'We are no longer as optimistic about removing the GIL completely. '

smikhanov · on Dec 16, 2009

I don't want to put professional qualities of Google engineers under doubt, but could someone explain to me why it's so complicated to remove GIL in Unladen Swallow? As far as I remember, UnSw targets LLVM which is an advanced VM with JIT compilation (i.e. not fully interpreted). Jython does (I'm not a Jython developer, so may not know the details) nearly the same, but targets the JVM and Jython does not have GIL as a result. What's the key difference between LLVM and JVM in this regard?

j_baker · on Dec 16, 2009

One other potential difficulty is that Jython doesn't aim to be 100% compatible with existing C extensions. Unladen swallow does. Plus, I believe Unladen Swallow is essentially a branch of the CPython source code while Jython was written from scratch.

cconstantine · on Dec 16, 2009

The simple answer is that LLVM does not come with a garbage collection system, and the JVM does.

sandGorgon · on Dec 16, 2009

snigger - first clojure and now jython. And the news about high profile websites moving to Lift/Scala from <whatever>.

Is the JVM the answer to life, the universe and everything ?

fauigerzigerk · on Dec 16, 2009

Well, the JVM is a high quality, mature, very fast, cross platform runtime. However, I'm always slightly concerned that letting the JVM in through the front door might open the back door to the kind of SOP (Soviet Oriented Programming) exemplified by Java EE. JSR hell is lurking behind the most innocent-looking serverside library :-)

smikhanov · on Dec 16, 2009

I can't see how JSR argument is applicable here.

Java suffers from JSR approval hell just because it tries to stadardize everything. This is the requirement to perform well in the enterprise world, along with the programmers certification mechanism and inclusion of the top industry players in the standards approval.

If your language is just a hackers' toy (Clojure and Scala are exactly that) then why do you care about JSRs in the first place? The JSRs don't mention JVM usually, only the libraries or Java language extensions.

fauigerzigerk · on Dec 16, 2009

I agree with you that it should be possible to avoid much of the complexity of Java EE if you just use the JVM.

But one motivation for using a JVM based language in the first place is that so many libraries are available on that platform. Unfortunately, many of them suffer from excessive complexity and some of that complexity is due to the JCP.

sandGorgon · on Dec 16, 2009

now if only java had a package manager (not maven) - something like "gem" for jar files.

That would be golden.

smikhanov · on Dec 17, 2009

Ruby community has its own traits. The "gem for Java" probably does not exist because not too much people need it (otherwise it would have been written already). In pure Java it's hard to use REPL, and you have to build your project anyway before running it, so maven suits better here.

bokchoi · on Dec 16, 2009

There is maven. Oh, and jpackage for rpm. And debian creates packages for quite a few jars.

Hopefully this will get easier in Java 7 with the modularity work in Project Jigsaw: http://openjdk.java.net/projects/jigsaw/

cconstantine · on Dec 17, 2009

Oh, I didn't mean that as a dig against llvm. I would say llvm and the jvm are solving different (but related) problems.

cdavid · on Dec 16, 2009

Removing the GIL is easy if you remove compatibility requirements with C extensions. A lot of complex C extensions rely on the exact python ref counting semantics. That's why unladen swallow is interesting in the first place.

axod · on Dec 16, 2009

How about teaching programmers to program without using threads?

edit: sure downmod me. It's crazy talk! How could programmers do without threads and concurrency issues and all of the other blocking problems. Hardware should handle multiple cores. Not programmers.

wglb · on Dec 16, 2009

There is probably more traction from what you are saying that some of the commenters suggest. For compute bound tasks, it is often productive to split computations into long-running processes.

There are two reasons that the discussion goes beyond what you suggest, I think. One is detailed in http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot... where the wide finder project is a way to explore relatively easy ways to split a log-searching task into effective threads on a multi-core machine.

This is really a hard problem, as evidenced by Tim's long series of articles detailing various forays into clojure and other languages.

Your comment "Hardware should handle multiple cores" reflects the opposite of what I think chip manufacturers are thinking. They run into the performance barrier, so they build a chip with more CPUs on it and hand the problem off to the compiler team and the rest of the software world.

I would take it another step further in challenging hardware manufacturers to look at the broader problem. There was an article recently that noticed that for Lisp, the effective performance gain over a decade or two went up by 50 where for C-family programs it went up by several orders of magnitude. To me this implies that hardware isn't going in the direction that supports higher-level computing.

Remember when the 360 instruction set came out? The 7094 people looked at it with some sense of dissapointment. And where are the nice instruction sets as evidenced by the PDP-10 and family?

Perhaps this implies smarter cores so that we don't have to do so many of them.

But in today's world, it seems that the languages that work well with multiple threads have a language construct that is required to make it work--libraries don't do the trick. The clean channels of GO and the constructs in Clojure point the way. Maybe the GIL-fix approach is truly doomed.

So I agree with your closing sentiment.

andrew1 · on Dec 16, 2009

I think you're being downvoted because people disagree with what you're saying. In my experience you need multiple threads when you want multiple things to happen at the same time. i.e. if I have a client/server architecture and one client instructs the server to perform a long running task then I don't want the server to appear frozen to all my other clients, which it would if the server ran in a single thread. I don't really see how you can get around this. Do you have a solution?

axod · on Dec 16, 2009

>> "you need multiple threads when you want multiple things to happen at the same time"

Computers don't work that way. Unless you have many CPUs, nothing happens at the same time.

Rewrite your 'long running task' to do things bit by bit. By effectively doing your own timeslicing, you remove the need for any locking or concurrency issues. Once you get into the habbit of programming like this, you wouldn't believe how much easier things are.

FWIW this is how Mibbit backend works - thousands of connections handled in a single thread.

Javascript doesn't have threads (thankfully). There is no need for threads. They look like magic, but they cause more issues than they solve IMHO. The mapping of 'work' onto physical CPUs should be done silently by the hardware IMHO (If you have more than one CPU).

prodigal_erik · on Dec 16, 2009

Windows 3 and MacOS classic apps used to get wedged every so often because cooperative multitasking is so easy to get slightly wrong.

Writing everything in continuation-passing style is like writing your own filesystem or parser state machine. It's a little more efficient per core, it's complicated enough that the challenge of getting it right might be gratifying, but it's rarely the best use of our time when we're surrounded with ridiculously powerful hardware. Even netbooks have multiple cores now.

coliveira · on Dec 16, 2009

> Windows 3 and MacOS classic apps used to get wedged every so often because cooperative multitasking is so easy to get slightly wrong

This is a different issue: multitasking of processes in the same CPU. This is an area that needs to be done by the OS, because you can't guarantee that other processes will be cooperative.

Inside your own process, however, you can do whatever you want by diving tasks in small chunks, without the need for threads.

The only place where I think we need threads is when dealing with libraries that we can't control. For example, UI libraries, networking libraries, math libraries, etc.

jcl · on Dec 16, 2009

I think his point is that when you "divide tasks in small chunks", you are no longer writing code in the way that is most convenient for you but in a way that gets better performance from a simple compiler. It means more code spread over more functions, which increases the probability of bugs.

bobbyi · on Dec 16, 2009

> Computers don't work that way. Unless you have many CPUs, nothing happens at the same time.

I have many CPUs.

caf · on Dec 16, 2009

Making the hardware map a single thread of control onto parallel computations on separate cores is a hard problem - one that hasn't been solved yet.

If you reckon you can solve it, you should definitely be able to get rich from it.

axod · on Dec 16, 2009

I don't think it's a problem that needs to be solved for most applications. Certainly for average websites/webapps, they're just passing data around. CPU usage shouldn't be high at all unless something really computationally expensive needs to go on like speech recognition or video encoding or something.

I'll let the scientists who actually need CPU power figure that one out.

Most of the websites people here are working on could be run on a 386 and still have cycles to spare.

scott_s · on Dec 16, 2009

Computers don't work that way. Unless you have many CPUs, nothing happens at the same time.

Most processors in the future will be parallel architectures of some kind. Even my current laptop is a multicore.

coliveira · on Dec 16, 2009

I think the idea that every piece of software will run in parallel in the future is nonsense. Hardware vendors are just trying to create a need where there is none.

Clearly, someone will find applications where this power is needed (graphics, simulation, robotics), but there is no way that MS Word will run in parallel in more than a few processors. The biggest change in multicore is in enabling new applications, not in changing the way current applications are developed.

scott_s · on Dec 16, 2009

I did not say every piece of software will run in parallel, or that it should. The subject is the runtime environment of a programming language. If Python is going to be used in these new applications you brought up, it would help if they were able to remove the GIL.

axod · on Dec 16, 2009

I've heard that argument since we started having dual CPUs 5-10 years ago. I don't buy it really. Netbooks are so popular mainly because we don't need so much cpu power on our local thin clients to the web.

liuliu · on Dec 16, 2009

These kind of argument sounds so familiar. Decades ago, when people worked on supercomputer, hundreds of millions dollars thrown to "parallel compiler", "shared memory machine" which aimed to reduce the complexity of parallel computing for programmers. But, it just doesn't work. If a programmer cannot aware of the underlying architecture of your parallel machine, the performance will get heavily harmed. That's why there are threads, message-passing, NUMA today.

_csoo · on Dec 16, 2009

You know why it doesn't work? Because programmers are still mainly using C or C-derived languages.

dkersten · on Dec 17, 2009

Totally agree.

For one, languages like Erlang, Haskell and Clojure already make concurrent programming reasonably approachable and second, if people were to switch to using proper dataflow languages* then parallelization is implicit and automagically done for the programmer.

* What I really want is a proper dataflow/imperative hybrid that lets me choose the right tool for the job...

andrew1 · on Dec 16, 2009

I appreciate that nothing happens at the same time on a single core but CPU time is shared between threads so it 'appears' as if more than one thing happens at once. A good example of this is a web browser on a single core machine - the browser does not freeze up while it is downloading data. That is because the CPU time is shared between the UI thread and the other worker threads.

axod · on Dec 16, 2009

Possibly in some browsers.

An alternate (better) model would be to simply have a single thread with a main loop, have async networking, and UI updates periodically in the same thread.

  while(true) {

     networking.check(); // Check if any sockets are ready for read/write/connect
     ui.update();    // Update the UI a bit if needed
  }

The only case this would be a terrible idea is if you don't have control of all the code, or need to interface to things that may block/crash/etc.

scott_s · on Dec 16, 2009

No modern web browser controls all of the code, since it must execute arbitrary JavaScript - which can block and crash.

Everyone doing something doesn't make them right. But when all major instances of an application are implemented differently than you think is best, perhaps you don't understand the problem as well as you think you do. Chrome, I think, is the best browser architecture, and it looks like IE and Firefox will adapt something similar. I think they use separate processes to manage tabs instead of threads, but it's still parallel.

axod · on Dec 16, 2009

>> "No modern web browser controls all of the code, since it must execute arbitrary JavaScript - which can block and crash."

That's a silly argument. The following can't block or crash.

  while(true) {
    jsRuntime.executeInstruction();
    // Other stuff.
  }

scott_s · on Dec 16, 2009

Execute a single JavaScript instruction at a time? I doubt the performance of that would be acceptable. But if you're aware of any browsers doing that, I'd like to know.

axod · on Dec 16, 2009

If I were to write a browser right now, it's how I'd do it. You would more likely execute a few js instructions per loop, depending on what else you have to do in that loop also - network check, ui update, etc.

Why would there be a performance hit in doing js instructions one by one though ;) A loop isn't expensive.

dkersten · on Dec 17, 2009

A loop isn't expensive, but swapping your interpreters code out of registers and cpu cache for each js instruction may be. Anyway, this method doesn't really get you much in the way of JIT compiling your JS either..

scott_s · on Dec 16, 2009

You would have terrible cache performance, since you would constantly bounce back and forth between the JavaScript VM and other browser code.

axod · on Dec 16, 2009

Code is code. If you can explain that a bit more I'd be interested.

Obviously you wouldn't update the UI every time you execute a js instruction. That would be insane. I just put 1 js instruction in the loop to have the minimum unit, in case anything else needs to be updated very quickly at the same time - eg some animation etc

tsuraan · on Dec 16, 2009

CPUs (or cores in a CPU) have a pretty small instruction cache. On a Core2, I believe it's 32K cache per core. It's somewhat expensive (slow) to fetch instructions from RAM to L1, but once code is in L1, reading from it is immediate, or close to it. If you have a two-core machine where one core can do most of your js dispatch logic, and the other core can do the inner loop of your rendering logic, then you can have much better performance than constantly swapping out your logic on the cache of a single core.

I'm really not sure how useful it is to cache-optimise a browser. You need to be able to fit a significant amount of logic into 32K in order to take advantage of cache. When I used to do realtime image processing (AR), I could get an order of magnitude speedup by just getting my logic to fit into 32K; image processing code can literally go from 3fps to 30fps just by tightening code to the point where it can fit in L1 cache. I don't know if it's possible to fit significant amounts of rendering logic or js logic into 32K, but if it is, then dedicating a core to each of those functions could give a significant speedup.

scott_s · on Dec 16, 2009

The JavaScript VM will have a significant amount of state associated with it. Executing a virtual instruction will require accessing that state. If that data is not in the CPU's cache, it will cause cache misses, which stall code progression.

If you then use that data in the cache for a while, then the cost of the cache miss will be amortized. But what you're proposing is going back and forth quickly between the JavaScript VM and the rest of the browser code. The browser code will also need to bring its data into the cache, which will kick out the JavaScript VM's data.

Since you're proposing that the JavaScript VM should do a very small amount of work at each time, and it will likely need to bring all of its data back into the cache each time, you will see a lot of CPU stalls.

axod · on Dec 17, 2009

Yeah I think we have a long way to go before js performance is affected by CPU caches.

coliveira · on Dec 16, 2009

This only works if the UI library has good performance. During years this was a big issue in Java, and why it really needed multi-threading.

gaius · on Dec 16, 2009

This is how Tcl works.

andrew1 · on Dec 16, 2009

I can't reply to your comment to this. Pretty much everything I do at work involves communicating with resources which may block/crash so it wouldn't really be practical to put everything in a single while loop. It sounds a little bit to me like you're re-inventing the wheel - multi-threading means you don't have to go to the pains of breaking your long-running tasks up into pieces, the infrastructure takes care of that for you. I'm not trying to convert you, it sounds like you're perfectly happy and successful working in a single threaded environment. It just sounds like you're doing quite a lot of work to avoid multiple threads, which really aren't that difficult to manage.

axod · on Dec 16, 2009

... and then you have to deal with all the concurrency and blocking issues that come with outsourcing your timeslicing.

Timeslicing isn't hard. It's even simpler if you're the one writing the code.

If you are in the horrible position of having to interface with blocking/crashing code though, great shame :(

andrew1 · on Dec 16, 2009

One of the big pains of my work day is our accounting system which, while being great at what it does well, can be abysmally slow to respond to queries (the simple question of how much of X do we own now, takes it around three minutes to answer...). We know from experience though that internally it can deal with up to three requests at a time without it slowing down. If I only had one thread then three requests (which block) would take me nine minutes to process, with three threads I can get all the results back in three minutes.

axod · on Dec 16, 2009

Interfacing with a broken bad interface is never fun. But perhaps fix the problem instead of working around it and letting it spread...

andrew1 · on Dec 16, 2009

Unfortunately that's not practical, it's third party software and at least the last time we checked there wasn't a better alternative with the same functionality.

dkersten · on Dec 17, 2009

nothing happens at the same time on a single core

Not always true. There are multiple levels of parallelism in modern CPUs and cores is only one of these, eg instruction-level parallelism, hyperthreading etc

Anyway, theres more to concurrent programming than doing multiple things at once in parallel - theres also the use of threads to turn a synchronous blocking call into an asynchronous one, like you said.

ErrantX · on Dec 16, 2009

> By effectively doing your own timeslicing, you remove the need for any locking or concurrency issues

Native threads do this timeslicing for you; that is why they were invented (so that programmers didnt have to mess around with crazy micro-management of that they were doing).

Can you imagine writing Chrome or FireFox using a single thread (it would be hard, and probably unusable)? :)

Threads serve an excellent purpose

coliveira · on Dec 16, 2009

> Native threads do this timeslicing for you; that is why they were invented

But they don't solve the fundamental problem of how to organize your code to take advantage of this. It is really easy to create a threaded application that has lower performance than a single threaded application that divides work by chunks. And, after all research in this area, there is no clear way to use threads without spending a lot of time to make sure that they work.

axod · on Dec 16, 2009

Agree 100%. Often using threads are a false economy.

krakensden · on Dec 16, 2009

> Can you imagine writing Chrome or FireFox using a single thread (it would be hard, and probably unusable)? :)

Firefox is single threaded.

Chrome runs each tab in its own process- also not multithreading.

andrew1 · on Dec 16, 2009

At least on Windows they are both multi-threaded. If you look in Task Manager then firefox.exe will have multiple threads (32 on my machine) and each chrome.exe process will have multiple threads (anywhere between 3 and 21 on the 15 tabs I have open).

petewarden · on Dec 16, 2009

> Firefox is single threaded

Incorrect. See http://mxr.mozilla.org/firefox/search?string=thread

_csoo · on Dec 16, 2009

FWIW this is how Mibbit backend works - thousands of connections handled in a single thread.

You might as well not be running an OS and have Mibbit run straight on the hardware. All it's doing is handling network connections and some other data processing right? And don't look at me like I'm crazy, your OS is using valuable CPU time!

axod · on Dec 16, 2009

... which would be relevant if it was CPU bound.

j_baker · on Dec 16, 2009

The problem is that you're also preventing yourself from being able to take advantage of multiple cores. Of course, it would be nice if hardware would divide those tasks up between processors automatically. But then they'd be threads.

j_baker · on Dec 16, 2009

It's not so much that I disagree with him/her. In fact, I would love it if we could do away with memory sharing. It's just that it seems a bit utopian (especially the argument that the hardware should do it for you).

My understanding is that he/she wanted to have the hardware handle concurrency automatically. Unfortunately, I don't see any way to get around having software dealing with concurrency.

randallsquared · on Dec 16, 2009

I may be misunderstanding your question, but it seems as though your "server" could be written as a one-request server and then you'd just start another one to listen when a request came in. If starting a new process is too slow, use a pool. Some versions of Apache use this method. It's not a panacea, but it does provide a way you could use to avoid threaded code entirely.

andrew1 · on Dec 16, 2009

OK, but surely there are cases where it would be useful (I would argue necessary) to have a single process with multiple threads - when dealing with GUIs for example.

randallsquared · on Dec 16, 2009

No argument that there are cases where it's useful; I was just saying it's possible to avoid.

bioweek · on Dec 16, 2009

Sounds like a good idea. I'd upvote you but I can't seem to vote on comments for some reason. Too young, or new?

silentbicycle · on Dec 16, 2009

It looks like your karma isn't high enough.

bioweek · on Dec 16, 2009

How high do I need?

wglb · on Dec 16, 2009

Used to be 200, but might have gotten upped.

tsuraan · on Dec 16, 2009

It's around 200 for downvoting, but I think upvoting is a lot lower. Closer to 20, I think, but I've been able to upvote for a really long time.

dkersten · on Dec 17, 2009

And hardware can handle concurrency - you just need to switch to a dataflow language.

antirez · on Dec 16, 2009

Instead of dealing with all this complexity, I don't understand why a simpler approach is not used, like having a single interpreter per thread and a very good message passing strategy between interpreters.

mahmud · on Dec 16, 2009

You could already achieve that with OS processes and IPC. The whole point of having multi-threading is to be able to write compact, shared-memory code with minimal use of synchronization operators, and sharing as much code and data as possible.

One interpreter per-thread means all side-effects have to be migrated to the other threads to keep a consistent view of memory: guess what you will need to do that? yep, a global lock (except this time it's across all interpreters, instead of just one.)

yummyfajitas · on Dec 16, 2009

If you restricted shared memory to objects explicitly declared as shared, you wouldn't need a GIL. You'd simply need per-object locks.

For scientific computing purposes, you can often accomplish this with multiple processes and a numpy array allocated by shmget/shmat. But I'm not sure how to share complex objects in this way.

nailer · on Dec 16, 2009

I'm not quite sure if I'm right here (and I'd appreciate it if another HN reader corrected me).

But I think that's how the Queue object in Python 2.6 works. The Queue instance is locked, you seem to be free to do whatever within the threads that are consuming the queue.

The reason I'm not sure is that having a single object being locked seems to contradict the GIL concept...

dkersten · on Dec 17, 2009

...and if we moved away from the annoying-and-awkward shared memory model, then this becomes even less of a problem still.

antirez · on Dec 16, 2009

The whole point of the "Smart message passing" strategy is to move objects from one thread to another one very fast (that is, zero-copy), and with very local locking requirements.

thwarted · on Dec 16, 2009

Like perl's ithread implementation?

I like perl's ithread setup a lot. You explicitly need to mark data as shared between threads, otherwise all variables/objects are local to the thread. Things like queues, for example, are implemented as shared @arrays with a locking primitive around the manipulations, mostly hidden behind the Queue API (Queue->enqueue and Queue->dequeue).

The interpreter code is still shared among all the threads, but each has thread local storage separate from the shared arena. I've found that explicitly needing to mark data as shared to be a big boon to development, I think it has helped reduce the number of the bugs related to shared state.

FooBarWidget · on Dec 16, 2009

Last time I used ithread it made Perl crash and burn. This was on program with about 30k lines of code. On top of that, creating a single thread takes 15 seconds because Perl is spending a ridiculous amount of time copying the entire state to the new thread. Eventually I used fork() and multiprocessing because that's much faster than using ithread. ithread is pretty useless.

thwarted · on Dec 16, 2009

Then you must have used them a really long time ago. The last time I did, I had no trouble decoding audio streams in one thread, pushing the raw audio samples into a Queue, pulled the raw audio samples out of the Queue in another thread and sent them to the audio device, and doing network IO in another thread. This code is at http://github.com/thwarted/thundaural (no longer maintained) and was written in 2004 and 2005 against perl 5.6 and perl 5.8. It's hardly a complex or large threaded application, nor an example of fantastic code, but I never experienced "crash and burn" or prohibitively long thread start up time.

I can't produce a significantly long start up time when perl copies a 240meg+ local state using http://gist.github.com/258016. Much larger, and the overhead pushes my 4gig machine into swap (so perl's not very efficient with its data overhead, it seems). It is slower than if the structure is shared (you can uncomment a line to compare) between the threads, which shows that perl is copying the state to the other interpreters, but this just says you need to use the same model with perl threads that you'd use when you do multiprocessing with fork if you're going to have long lived child processes that don't exec: spawn threads early before your heap balloons to avoid all that spawn-time copying (which is a good practice anyway, even when COW is used for fork). Every model has its trade offs, and it's good to know those trade offs going in, rather than treating the flavor of the day as a silver bullet.

And even if perl's implementation isn't stable/good, it doesn't mean the model of explicit sharing is a bad one. And it most definitely doesn't mean that using a separate interpreter in each thread with global sharing is bad at all.

gaius · on Dec 16, 2009

This is how Tcl works.

antirez · on Dec 16, 2009

Yes, probably Tcl lacks some zero copy object passing mechanism IIRC.

psranga · on Dec 16, 2009

Tcl has reference counting and copy on write to minimize copies when passing objects to functions. Not sure if this is the case for slave interpreters (Tcl's term), but I would be very surprised if it were not.

http://tmml.sourceforge.net/doc/tcl/Object.html

http://www.equi4.com/moam/strength

silentbicycle · on Dec 16, 2009

Lua, too, though there are a couple extensions to add threads.

j_baker · on Dec 16, 2009

You mean like the multiprocessing package? http://docs.python.org/library/multiprocessing.html

draegtun · on Dec 16, 2009

No because IIRC the Python multiprocessing package uses fork rather than threads.

mahmud · on Dec 16, 2009

In Linux they're identical :-)

duskwuff · on Dec 16, 2009

Noo... fork() gives you a separate address space, whereas threads would require that you share an address space with the parent.

vfork() and clone() can almost give you what you want, but Linux threading is using something different (NPTL) nowadays.

mahmud · on Dec 16, 2009

You're correct. My position was that in Linux, threads weigh just as much as processes; the forking speed is identical, and extremely fast with COW. Compared to other OSes where processes and threads are completely different beasts.

Even with NPTL, both processes and threads have task_structs.

thwarted · on Dec 16, 2009

NPTL uses clone(2). fork(2) is implemented in terms of clone(2) also.

nas · on Dec 16, 2009

On plaforms without fork it uses threads (e.g. windows). Why do you care if it is using fork() anyhow?

draegtun · on Dec 16, 2009

Thats a question you should be asking the multiprocessing module guys ;-)

j_baker · on Dec 16, 2009

This prompted me to ask this question on stackoverflow: http://stackoverflow.com/questions/1914605/what-does-pythons...

I'd like it if someone could show me how the garbage collector is related to removing the GIL.

silentbicycle · on Dec 16, 2009

Python's garbage collector itself isn't "thread safe", so removing the GIL without replacing the GC would cause major problems.

ig1 · on Dec 16, 2009

I'm no expert in the GIL, but pretty much every widely adopted Garbage Collection algorithm requires a "stop-the-world" phase where object references can't be changed. Every VM has some concept of "stop points" where all user code is suspended, but Python's GIL is much more wide-ranging than that found in say the JVM or .NET.

cemerick · on Dec 16, 2009

The JVM (and I presume the CLR as well) has a couple of different garbage collectors that support concurrent (i.e. not stop-the-world) operation:

http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning...

I'm no GC expert, but these get you good, fast, and mostly concurrent GC (in very high-pressure situations where one is running out of memory, then I think both fall back to a serialized stop-the-world last-ditch-effort).

ig1 · on Dec 16, 2009

Luckily as it happens I am an expert in JVM Garbage Collectors :-). All of Sun Java's Garbage Collectors have a stop-the-world phase, the concurrent ones minimize the length of that stop-the-world period but it still has one.

If you read the section on the concurrent collector on the link you posted you can see it says "During each major collection cycle, the concurrent collector will pause all the application threads for a brief period at the beginning of the collection and again toward the middle of the collection. The second pause tends to be the longer of the two pauses and multiple threads are used to do the collection work during that pause"

j_baker · on Dec 16, 2009

In fact, I believe that the JVM and the CLR both run GC in a different thread altogether.

Tuna-Fish · on Dec 16, 2009

Python doesn't really have real GC, it refcounts. See the stackoverflow answer.

jemfinch · on Dec 16, 2009

Python refcounts and it has a real mark+sweep collector for collecting cycles. It's not a dichotomy, you know.

silentbicycle · on Dec 16, 2009

The reference counting causes the problem here. When multithreading means that incrementing/decrementing a reference count is no longer deterministic, you need locks, or all hell breaks loose. Adding a mark&sweep GC isn't going to fix that.

While I'm not familiar with the specifics of Python's GC, a mark&sweep phase is usually added to reference counting so that if there's garbage which contains references to itself but has no external references, it will eventually be collected. (_Garbage Collection_ by Richard Jones and Rafael Lins is an excellent resource on GC details, btw. There's also a decent overview in the O'Reilly OCaml book (http://caml.inria.fr/pub/docs/oreilly-book/html/book-ora082.... )). In other words, it plugs the worst memory leaks caused by reference counting.

How to do multiprocessor / multithread GC well is still an area of active research. In the mean time, one simpler solution is to have several independent VM states, each running in their own thread (or process), and communicating via message passing. Lua makes this easy, but its VM is considerably lighter than Python's.

kingkilr · on Dec 16, 2009

The mark and sweep collector is a little more complicated than that. It doesn't use a single "seen" bit the way a normal GC header would, instead it uses the refcount itself in very, very clever ways

bioweek · on Dec 16, 2009

How does a real GC work? I can't picture how you can do GC without ref counts.

(sorry if that's a dumb question)

j_baker · on Dec 16, 2009

I would disagree with saying that refcounting isn't "real" garbage collection.

But with that said, you can get all the gory details from wikipedia: http://en.wikipedia.org/wiki/Garbage_collection_(computer_sc...

coliveira · on Dec 16, 2009

A GC looks at the content of memory and checks if each piece of memory is accessible (there are sophisticated algorithms to do this) - no need for a ref count is necessary with GC.

bioweek · on Dec 16, 2009

why do you mean by accessible? Wouldn't that still mean reference counting, i.e., it's not accessible if it has 0 references?

jeffcoat · on Dec 16, 2009

Here's a (very naive) algorithm that doesn't count references:

1. Start at some well-known root object. Mark this object reachable.

2. For each object that is reachable

   2a. For each object that this reachable object has a reference to 

       2b. Mark this object reachable if it's not already.

3. If any object was marked reachable in 2b, repeat step 2.

4. Garbage collect every object not marked reachable.

To see an important difference between this kind of strategy ("mark and sweep") and reference counting, consider the classic example of two objects, each one of which has a link to the other, but are not visible to anything else in the system.

Tuna-Fish · on Dec 16, 2009

Glad to be of service. ;)

euroclydon · on Dec 16, 2009

If Python can't get this threading thing worked out, isn't the language going to get left behind as parallel architecture marches onward?

cdavid · on Dec 16, 2009

There are many ways to exploit multi-cores, multi-threading is just one of them. Many other techniques exist. Also, one thing to realize is that if speed really matters (like in scientific apps), you will get much higher speed increase by rewriting some parts in C than by allowing using all the cores from python (at least with only a couple of cores).

Finally, a point which is not often brought but is crucial in my opinion is about C extension: the GIL makes C extensions much easier to write. That's one big reason for python success in the first place.

zepolen · on Dec 16, 2009

Why are real threads so important? Does anyone have an example where threads would be much better than using the multiprocessing module?

fauigerzigerk · on Dec 16, 2009

Complex in memory data structures are the main use case that is not well supported by multi process architectures. There's a trend towards in memory databases and using disk for archival only, so this issue is becoming critical.

Processes don't share pointers, they only share BLOBs. Threads do share pointers and that's why they are more suitable to in memory data analysis/manipulation.

krakensden · on Dec 16, 2009

Don't the in-memory databases generally run in another process (memcached) anyway?

(Not that your general point is wrong)

fauigerzigerk · on Dec 16, 2009

Yes but if I'm the one writing in memory database/analytics stuff I might want to use a Python/C combination to do that. Right now that's not an option because of the GIL issue.

old-gregg · on Dec 16, 2009

But don't you want your in-memory database to be non-locking, i.e. serving many clients simultaneously and fast?

krakensden · on Dec 19, 2009

The point was your in-memory database is an external, non-python program, and you run lots of python clients.

The way most web applications are structured, for instance.

mahmud · on Dec 16, 2009

With native threads, all your threads have their own identity and they're known to the OS task scheduler. So they can all block or run independently. But with green threads, the OS doesn't know about your "threads"; when the parent process is blocked, so are all the threads.

barrkel · on Dec 16, 2009

One of the basic assumptions behind green threads is that you aren't going to make blocking OS calls from your green threads - the interpreter should intercept those, run them in a different thread (or ideally, run with async I/O) so that it can schedule a different thread until the call returns.

stonemetal · on Dec 16, 2009

Note GP said multiprocessing (that is a python library for multiple processes with a thread like interface, not green threads.) As for why some see it as not enough well it is kind of like CGI vs FCGI spining up a whole new process with and VM is not a cheap operation compared to starting up a thread. Sure you can work around it with design changes pools and what not, but some would rather see the problem fixed correctly rather than worked around.

zweiterlinde · on Dec 16, 2009

In the long run, it seems like approaches like ConCRT and User Mode Scheduling (Windows only, sadly) could address this. The OS can provide support for detecting any blocking operations (even page faults, for example) and doing task-level switching for you.

heresy · on Dec 16, 2009

Blocking operating system calls? This has always been the problem I've run into with green threads.

parbo · on Dec 16, 2009

I think the GIL is released during most of those, so that should already work with real threads.

j_baker · on Dec 16, 2009

Note that Python doesn't use green threads. It uses real, native threads. The thing is that the interpreter will prevent any other threads from accessing it until the GIL is released.

nas · on Dec 16, 2009

The problem with the GIL is nearly the opposite. The lock is dropped for system calls but kept when Python code is running. That's not a problem for IO bound workloads but does not scale well over multiple CPUs for CPU bound tasks that do most of the work in Python-land.

I not sure dropping the GIL is the best solution. Something like Erlang message passing sounds like a better model. I'm not sure if that's a good fit either though since it works best with functional style programming.

ghshephard · on Dec 16, 2009

March 2009, presuming the comments followed the creation of that article.

ZeroGravitas · on Dec 16, 2009

No, I remember reading those comments before this revision.

I was under the impression that these Google code wiki pages were kept under source control so you should be able to view histories etc. but I can't see any obvious link.

Found it, the change to this section was done 33 hours ago, a diff can be found here:

http://code.google.com/p/unladen-swallow/source/detail?spec=...

nihilocrat · on Dec 16, 2009

At least they want to get rid of the horrible abomination known as reference counting. It's the primary reason why I've moved on to other languages for the sake of creating games.