I don't want to put professional qualities of Google engineers under doubt, but could someone explain to me why it's so complicated to remove GIL in Unladen Swallow? As far as I remember, UnSw targets LLVM which is an advanced VM with JIT compilation (i.e. not fully interpreted). Jython does (I'm not a Jython developer, so may not know the details) nearly the same, but targets the JVM and Jython does not have GIL as a result. What's the key difference between LLVM and JVM in this regard?
One other potential difficulty is that Jython doesn't aim to be 100% compatible with existing C extensions. Unladen swallow does. Plus, I believe Unladen Swallow is essentially a branch of the CPython source code while Jython was written from scratch.
Well, the JVM is a high quality, mature, very fast, cross platform runtime. However, I'm always slightly concerned that letting the JVM in through the front door might open the back door to the kind of SOP (Soviet Oriented Programming) exemplified by Java EE. JSR hell is lurking behind the most innocent-looking serverside library :-)
Java suffers from JSR approval hell just because it tries to stadardize everything. This is the requirement to perform well in the enterprise world, along with the programmers certification mechanism and inclusion of the top industry players in the standards approval.
If your language is just a hackers' toy (Clojure and Scala are exactly that) then why do you care about JSRs in the first place? The JSRs don't mention JVM usually, only the libraries or Java language extensions.
I agree with you that it should be possible to avoid much of the complexity of Java EE if you just use the JVM.
But one motivation for using a JVM based language in the first place is that so many libraries are available on that platform. Unfortunately, many of them suffer from excessive complexity and some of that complexity is due to the JCP.
Ruby community has its own traits. The "gem for Java" probably does not exist because not too much people need it (otherwise it would have been written already). In pure Java it's hard to use REPL, and you have to build your project anyway before running it, so maven suits better here.
Removing the GIL is easy if you remove compatibility requirements with C extensions. A lot of complex C extensions rely on the exact python ref counting semantics. That's why unladen swallow is interesting in the first place.
How about teaching programmers to program without using threads?
edit: sure downmod me. It's crazy talk! How could programmers do without threads and concurrency issues and all of the other blocking problems. Hardware should handle multiple cores. Not programmers.
There is probably more traction from what you are saying that some of the commenters suggest. For compute bound tasks, it is often productive to split computations into long-running processes.
There are two reasons that the discussion goes beyond what you suggest, I think. One is detailed in http://www.tbray.org/ongoing/When/200x/2009/09/27/Concur-dot... where the wide finder project is a way to explore relatively easy ways to split a log-searching task into effective threads on a multi-core machine.
This is really a hard problem, as evidenced by Tim's long series of articles detailing various forays into clojure and other languages.
Your comment "Hardware should handle multiple cores" reflects the opposite of what I think chip manufacturers are thinking. They run into the performance barrier, so they build a chip with more CPUs on it and hand the problem off to the compiler team and the rest of the software world.
I would take it another step further in challenging hardware manufacturers to look at the broader problem. There was an article recently that noticed that for Lisp, the effective performance gain over a decade or two went up by 50 where for C-family programs it went up by several orders of magnitude. To me this implies that hardware isn't going in the direction that supports higher-level computing.
Remember when the 360 instruction set came out? The 7094 people looked at it with some sense of dissapointment. And where are the nice instruction sets as evidenced by the PDP-10 and family?
Perhaps this implies smarter cores so that we don't have to do so many of them.
But in today's world, it seems that the languages that work well with multiple threads have a language construct that is required to make it work--libraries don't do the trick. The clean channels of GO and the constructs in Clojure point the way. Maybe the GIL-fix approach is truly doomed.
I think you're being downvoted because people disagree with what you're saying. In my experience you need multiple threads when you want multiple things to happen at the same time. i.e. if I have a client/server architecture and one client instructs the server to perform a long running task then I don't want the server to appear frozen to all my other clients, which it would if the server ran in a single thread. I don't really see how you can get around this. Do you have a solution?
>> "you need multiple threads when you want multiple things to happen at the same time"
Computers don't work that way. Unless you have many CPUs, nothing happens at the same time.
Rewrite your 'long running task' to do things bit by bit.
By effectively doing your own timeslicing, you remove the need for any locking or concurrency issues. Once you get into the habbit of programming like this, you wouldn't believe how much easier things are.
FWIW this is how Mibbit backend works - thousands of connections handled in a single thread.
Javascript doesn't have threads (thankfully). There is no need for threads. They look like magic, but they cause more issues than they solve IMHO. The mapping of 'work' onto physical CPUs should be done silently by the hardware IMHO (If you have more than one CPU).
Windows 3 and MacOS classic apps used to get wedged every so often because cooperative multitasking is so easy to get slightly wrong.
Writing everything in continuation-passing style is like writing your own filesystem or parser state machine. It's a little more efficient per core, it's complicated enough that the challenge of getting it right might be gratifying, but it's rarely the best use of our time when we're surrounded with ridiculously powerful hardware. Even netbooks have multiple cores now.
> Windows 3 and MacOS classic apps used to get wedged every so often because cooperative multitasking is so easy to get slightly wrong
This is a different issue: multitasking of processes in the same CPU. This is an area that needs to be done by the OS, because you can't guarantee that other processes will be cooperative.
Inside your own process, however, you can do whatever you want by diving tasks in small chunks, without the need for threads.
The only place where I think we need threads is when dealing with libraries that we can't control. For example, UI libraries, networking libraries, math libraries, etc.
I think his point is that when you "divide tasks in small chunks", you are no longer writing code in the way that is most convenient for you but in a way that gets better performance from a simple compiler. It means more code spread over more functions, which increases the probability of bugs.
I don't think it's a problem that needs to be solved for most applications. Certainly for average websites/webapps, they're just passing data around. CPU usage shouldn't be high at all unless something really computationally expensive needs to go on like speech recognition or video encoding or something.
I'll let the scientists who actually need CPU power figure that one out.
Most of the websites people here are working on could be run on a 386 and still have cycles to spare.
I think the idea that every piece of software will run in parallel in the future is nonsense. Hardware vendors are just trying to create a need where there is none.
Clearly, someone will find applications where this power is needed (graphics, simulation, robotics), but there is no way that MS Word will run in parallel in more than a few processors. The biggest change in multicore is in enabling new applications, not in changing the way current applications are developed.
I did not say every piece of software will run in parallel, or that it should. The subject is the runtime environment of a programming language. If Python is going to be used in these new applications you brought up, it would help if they were able to remove the GIL.
I've heard that argument since we started having dual CPUs 5-10 years ago. I don't buy it really. Netbooks are so popular mainly because we don't need so much cpu power on our local thin clients to the web.
These kind of argument sounds so familiar. Decades ago, when people worked on supercomputer, hundreds of millions dollars thrown to "parallel compiler", "shared memory machine" which aimed to reduce the complexity of parallel computing for programmers. But, it just doesn't work. If a programmer cannot aware of the underlying architecture of your parallel machine, the performance will get heavily harmed. That's why there are threads, message-passing, NUMA today.
For one, languages like Erlang, Haskell and Clojure already make concurrent programming reasonably approachable and second, if people were to switch to using proper dataflow languages* then parallelization is implicit and automagically done for the programmer.
* What I really want is a proper dataflow/imperative hybrid that lets me choose the right tool for the job...
I appreciate that nothing happens at the same time on a single core but CPU time is shared between threads so it 'appears' as if more than one thing happens at once. A good example of this is a web browser on a single core machine - the browser does not freeze up while it is downloading data. That is because the CPU time is shared between the UI thread and the other worker threads.
An alternate (better) model would be to simply have a single thread with a main loop, have async networking, and UI updates periodically in the same thread.
while(true) {
networking.check(); // Check if any sockets are ready for read/write/connect
ui.update(); // Update the UI a bit if needed
}
The only case this would be a terrible idea is if you don't have control of all the code, or need to interface to things that may block/crash/etc.
No modern web browser controls all of the code, since it must execute arbitrary JavaScript - which can block and crash.
Everyone doing something doesn't make them right. But when all major instances of an application are implemented differently than you think is best, perhaps you don't understand the problem as well as you think you do. Chrome, I think, is the best browser architecture, and it looks like IE and Firefox will adapt something similar. I think they use separate processes to manage tabs instead of threads, but it's still parallel.
Execute a single JavaScript instruction at a time? I doubt the performance of that would be acceptable. But if you're aware of any browsers doing that, I'd like to know.
If I were to write a browser right now, it's how I'd do it. You would more likely execute a few js instructions per loop, depending on what else you have to do in that loop also - network check, ui update, etc.
Why would there be a performance hit in doing js instructions one by one though ;) A loop isn't expensive.
A loop isn't expensive, but swapping your interpreters code out of registers and cpu cache for each js instruction may be. Anyway, this method doesn't really get you much in the way of JIT compiling your JS either..
Code is code. If you can explain that a bit more I'd be interested.
Obviously you wouldn't update the UI every time you execute a js instruction. That would be insane. I just put 1 js instruction in the loop to have the minimum unit, in case anything else needs to be updated very quickly at the same time - eg some animation etc
CPUs (or cores in a CPU) have a pretty small instruction cache. On a Core2, I believe it's 32K cache per core. It's somewhat expensive (slow) to fetch instructions from RAM to L1, but once code is in L1, reading from it is immediate, or close to it. If you have a two-core machine where one core can do most of your js dispatch logic, and the other core can do the inner loop of your rendering logic, then you can have much better performance than constantly swapping out your logic on the cache of a single core.
I'm really not sure how useful it is to cache-optimise a browser. You need to be able to fit a significant amount of logic into 32K in order to take advantage of cache. When I used to do realtime image processing (AR), I could get an order of magnitude speedup by just getting my logic to fit into 32K; image processing code can literally go from 3fps to 30fps just by tightening code to the point where it can fit in L1 cache. I don't know if it's possible to fit significant amounts of rendering logic or js logic into 32K, but if it is, then dedicating a core to each of those functions could give a significant speedup.
The JavaScript VM will have a significant amount of state associated with it. Executing a virtual instruction will require accessing that state. If that data is not in the CPU's cache, it will cause cache misses, which stall code progression.
If you then use that data in the cache for a while, then the cost of the cache miss will be amortized. But what you're proposing is going back and forth quickly between the JavaScript VM and the rest of the browser code. The browser code will also need to bring its data into the cache, which will kick out the JavaScript VM's data.
Since you're proposing that the JavaScript VM should do a very small amount of work at each time, and it will likely need to bring all of its data back into the cache each time, you will see a lot of CPU stalls.
I can't reply to your comment to this. Pretty much everything I do at work involves communicating with resources which may block/crash so it wouldn't really be practical to put everything in a single while loop. It sounds a little bit to me like you're re-inventing the wheel - multi-threading means you don't have to go to the pains of breaking your long-running tasks up into pieces, the infrastructure takes care of that for you. I'm not trying to convert you, it sounds like you're perfectly happy and successful working in a single threaded environment. It just sounds like you're doing quite a lot of work to avoid multiple threads, which really aren't that difficult to manage.
One of the big pains of my work day is our accounting system which, while being great at what it does well, can be abysmally slow to respond to queries (the simple question of how much of X do we own now, takes it around three minutes to answer...). We know from experience though that internally it can deal with up to three requests at a time without it slowing down. If I only had one thread then three requests (which block) would take me nine minutes to process, with three threads I can get all the results back in three minutes.
Unfortunately that's not practical, it's third party software and at least the last time we checked there wasn't a better alternative with the same functionality.
Not always true. There are multiple levels of parallelism in modern CPUs and cores is only one of these, eg instruction-level parallelism, hyperthreading etc
Anyway, theres more to concurrent programming than doing multiple things at once in parallel - theres also the use of threads to turn a synchronous blocking call into an asynchronous one, like you said.
> By effectively doing your own timeslicing, you remove the need for any locking or concurrency issues
Native threads do this timeslicing for you; that is why they were invented (so that programmers didnt have to mess around with crazy micro-management of that they were doing).
Can you imagine writing Chrome or FireFox using a single thread (it would be hard, and probably unusable)? :)
> Native threads do this timeslicing for you; that is why they were invented
But they don't solve the fundamental problem of how to organize your code to take advantage of this. It is really easy to create a threaded application that has lower performance than a single threaded application that divides work by chunks. And, after all research in this area, there is no clear way to use threads without spending a lot of time to make sure that they work.
At least on Windows they are both multi-threaded. If you look in Task Manager then firefox.exe will have multiple threads (32 on my machine) and each chrome.exe process will have multiple threads (anywhere between 3 and 21 on the 15 tabs I have open).
FWIW this is how Mibbit backend works - thousands of connections handled in a single thread.
You might as well not be running an OS and have Mibbit run straight on the hardware. All it's doing is handling network connections and some other data processing right? And don't look at me like I'm crazy, your OS is using valuable CPU time!
The problem is that you're also preventing yourself from being able to take advantage of multiple cores. Of course, it would be nice if hardware would divide those tasks up between processors automatically. But then they'd be threads.
It's not so much that I disagree with him/her. In fact, I would love it if we could do away with memory sharing. It's just that it seems a bit utopian (especially the argument that the hardware should do it for you).
My understanding is that he/she wanted to have the hardware handle concurrency automatically. Unfortunately, I don't see any way to get around having software dealing with concurrency.
I may be misunderstanding your question, but it seems as though your "server" could be written as a one-request server and then you'd just start another one to listen when a request came in. If starting a new process is too slow, use a pool. Some versions of Apache use this method. It's not a panacea, but it does provide a way you could use to avoid threaded code entirely.
OK, but surely there are cases where it would be useful (I would argue necessary) to have a single process with multiple threads - when dealing with GUIs for example.
Instead of dealing with all this complexity, I don't understand why a simpler approach is not used, like having a single interpreter per thread and a very good message passing strategy between interpreters.
You could already achieve that with OS processes and IPC. The whole point of having multi-threading is to be able to write compact, shared-memory code with minimal use of synchronization operators, and sharing as much code and data as possible.
One interpreter per-thread means all side-effects have to be migrated to the other threads to keep a consistent view of memory: guess what you will need to do that? yep, a global lock (except this time it's across all interpreters, instead of just one.)
If you restricted shared memory to objects explicitly declared as shared, you wouldn't need a GIL. You'd simply need per-object locks.
For scientific computing purposes, you can often accomplish this with multiple processes and a numpy array allocated by shmget/shmat. But I'm not sure how to share complex objects in this way.
I'm not quite sure if I'm right here (and I'd appreciate it if another HN reader corrected me).
But I think that's how the Queue object in Python 2.6 works. The Queue instance is locked, you seem to be free to do whatever within the threads that are consuming the queue.
The reason I'm not sure is that having a single object being locked seems to contradict the GIL concept...
The whole point of the "Smart message passing" strategy is to move objects from one thread to another one very fast (that is, zero-copy), and with very local locking requirements.
I like perl's ithread setup a lot. You explicitly need to mark data as shared between threads, otherwise all variables/objects are local to the thread. Things like queues, for example, are implemented as shared @arrays with a locking primitive around the manipulations, mostly hidden behind the Queue API (Queue->enqueue and Queue->dequeue).
The interpreter code is still shared among all the threads, but each has thread local storage separate from the shared arena. I've found that explicitly needing to mark data as shared to be a big boon to development, I think it has helped reduce the number of the bugs related to shared state.
Last time I used ithread it made Perl crash and burn. This was on program with about 30k lines of code. On top of that, creating a single thread takes 15 seconds because Perl is spending a ridiculous amount of time copying the entire state to the new thread. Eventually I used fork() and multiprocessing because that's much faster than using ithread. ithread is pretty useless.
Then you must have used them a really long time ago. The last time I did, I had no trouble decoding audio streams in one thread, pushing the raw audio samples into a Queue, pulled the raw audio samples out of the Queue in another thread and sent them to the audio device, and doing network IO in another thread. This code is at http://github.com/thwarted/thundaural (no longer maintained) and was written in 2004 and 2005 against perl 5.6 and perl 5.8. It's hardly a complex or large threaded application, nor an example of fantastic code, but I never experienced "crash and burn" or prohibitively long thread start up time.
I can't produce a significantly long start up time when perl copies a 240meg+ local state using http://gist.github.com/258016. Much larger, and the overhead pushes my 4gig machine into swap (so perl's not very efficient with its data overhead, it seems). It is slower than if the structure is shared (you can uncomment a line to compare) between the threads, which shows that perl is copying the state to the other interpreters, but this just says you need to use the same model with perl threads that you'd use when you do multiprocessing with fork if you're going to have long lived child processes that don't exec: spawn threads early before your heap balloons to avoid all that spawn-time copying (which is a good practice anyway, even when COW is used for fork). Every model has its trade offs, and it's good to know those trade offs going in, rather than treating the flavor of the day as a silver bullet.
And even if perl's implementation isn't stable/good, it doesn't mean the model of explicit sharing is a bad one. And it most definitely doesn't mean that using a separate interpreter in each thread with global sharing is bad at all.
Tcl has reference counting and copy on write to minimize copies when passing objects to functions. Not sure if this is the case for slave interpreters (Tcl's term), but I would be very surprised if it were not.
You're correct. My position was that in Linux, threads weigh just as much as processes; the forking speed is identical, and extremely fast with COW. Compared to other OSes where processes and threads are completely different beasts.
Even with NPTL, both processes and threads have task_structs.
I'm no expert in the GIL, but pretty much every widely adopted Garbage Collection algorithm requires a "stop-the-world" phase where object references can't be changed. Every VM has some concept of "stop points" where all user code is suspended, but Python's GIL is much more wide-ranging than that found in say the JVM or .NET.
I'm no GC expert, but these get you good, fast, and mostly concurrent GC (in very high-pressure situations where one is running out of memory, then I think both fall back to a serialized stop-the-world last-ditch-effort).
Luckily as it happens I am an expert in JVM Garbage Collectors :-). All of Sun Java's Garbage Collectors have a stop-the-world phase, the concurrent ones minimize the length of that stop-the-world period but it still has one.
If you read the section on the concurrent collector on the link you posted you can see it says "During each major collection cycle, the concurrent collector will pause all the application threads for a brief period at the beginning of the collection and again toward the middle of the collection. The second pause tends to be the longer of the two pauses and multiple threads are used to do the collection work during that pause"
The reference counting causes the problem here. When multithreading means that incrementing/decrementing a reference count is no longer deterministic, you need locks, or all hell breaks loose. Adding a mark&sweep GC isn't going to fix that.
While I'm not familiar with the specifics of Python's GC, a mark&sweep phase is usually added to reference counting so that if there's garbage which contains references to itself but has no external references, it will eventually be collected. (_Garbage Collection_ by Richard Jones and Rafael Lins is an excellent resource on GC details, btw. There's also a decent overview in the O'Reilly OCaml book (http://caml.inria.fr/pub/docs/oreilly-book/html/book-ora082.... )). In other words, it plugs the worst memory leaks caused by reference counting.
How to do multiprocessor / multithread GC well is still an area of active research. In the mean time, one simpler solution is to have several independent VM states, each running in their own thread (or process), and communicating via message passing. Lua makes this easy, but its VM is considerably lighter than Python's.
The mark and sweep collector is a little more complicated than that. It doesn't use a single "seen" bit the way a normal GC header would, instead it uses the refcount itself in very, very clever ways
A GC looks at the content of memory and checks if each piece of memory is accessible (there are sophisticated algorithms to do this) - no need for a ref count is necessary with GC.
Here's a (very naive) algorithm that doesn't count references:
1. Start at some well-known root object. Mark this object reachable.
2. For each object that is reachable
2a. For each object that this reachable object has a reference to
2b. Mark this object reachable if it's not already.
3. If any object was marked reachable in 2b, repeat step 2.
4. Garbage collect every object not marked reachable.
To see an important difference between this kind of strategy ("mark and sweep") and reference counting, consider the classic example of two objects, each one of which has a link to the other, but are not visible to anything else in the system.
There are many ways to exploit multi-cores, multi-threading is just one of them. Many other techniques exist. Also, one thing to realize is that if speed really matters (like in scientific apps), you will get much higher speed increase by rewriting some parts in C than by allowing using all the cores from python (at least with only a couple of cores).
Finally, a point which is not often brought but is crucial in my opinion is about C extension: the GIL makes C extensions much easier to write. That's one big reason for python success in the first place.
Complex in memory data structures are the main use case that is not well supported by multi process architectures. There's a trend towards in memory databases and using disk for archival only, so this issue is becoming critical.
Processes don't share pointers, they only share BLOBs. Threads do share pointers and that's why they are more suitable to in memory data analysis/manipulation.
Yes but if I'm the one writing in memory database/analytics stuff I might want to use a Python/C combination to do that. Right now that's not an option because of the GIL issue.
With native threads, all your threads have their own identity and they're known to the OS task scheduler. So they can all block or run independently. But with green threads, the OS doesn't know about your "threads"; when the parent process is blocked, so are all the threads.
One of the basic assumptions behind green threads is that you aren't going to make blocking OS calls from your green threads - the interpreter should intercept those, run them in a different thread (or ideally, run with async I/O) so that it can schedule a different thread until the call returns.
Note GP said multiprocessing (that is a python library for multiple processes with a thread like interface, not green threads.) As for why some see it as not enough well it is kind of like CGI vs FCGI spining up a whole new process with and VM is not a cheap operation compared to starting up a thread. Sure you can work around it with design changes pools and what not, but some would rather see the problem fixed correctly rather than worked around.
In the long run, it seems like approaches like ConCRT and User Mode Scheduling (Windows only, sadly) could address this. The OS can provide support for detecting any blocking operations (even page faults, for example) and doing task-level switching for you.
Note that Python doesn't use green threads. It uses real, native threads. The thing is that the interpreter will prevent any other threads from accessing it until the GIL is released.
The problem with the GIL is nearly the opposite. The lock is dropped for system calls but kept when Python code is running. That's not a problem for IO bound workloads but does not scale well over multiple CPUs for CPU bound tasks that do most of the work in Python-land.
I not sure dropping the GIL is the best solution. Something like Erlang message passing sounds like a better model. I'm not sure if that's a good fit either though since it works best with functional style programming.
No, I remember reading those comments before this revision.
I was under the impression that these Google code wiki pages were kept under source control so you should be able to view histories etc. but I can't see any obvious link.
Found it, the change to this section was done 33 hours ago, a diff can be found here:
At least they want to get rid of the horrible abomination known as reference counting. It's the primary reason why I've moved on to other languages for the sake of creating games.