Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The biggest disadvantage of the shell is that, by exchanging data using text, you lose opportunities to check for errors in the output. If you call a function in a programming language and an erroneous output happens, you get a crash or exception. In a shell, you'll get empty lines or, worse, incorrect lines, that will propagate to the rest of the script. This makes it impractical to write large scripts and debugging them gets more and more complicated. The shell works well for a few lines of script, any more than that and it becomes a frustrating experience.


It's even worse than that. Most non-trivial, and some trivial scripts and one liners rely on naive parsing (regex/cut etc) because that's the only tool in the toolbox. This resulted in some horrific problems over the years.

I take a somewhat hard line that scripts and terminals are for executing sequential commands naively only. Call it "glue". If you're writing a program, use a higher level programming language and parse things properly.

This problem of course does tend to turn up in higher level languages but at least you can pull a proper parser in off the shelf there if you need to.

Notably if I see anyone parsing CSVs with cut again I'm going to die inside. Try unpicking a problem where someone put in the name field "Smith, Bob"...


> Notably if I see anyone parsing CSVs with cut again I'm going to die inside. Try unpicking a problem where someone put in the name field "Smith, Bob"...

How do you tackle this? Would you count the numbers of commas in each line then manually fix the lines that contain more fields?


You parse them and reject CSVs that do not conform. There is absolutely no way to reason about a malformed CSV.


Yes, as in "check the number of parsed fields for each line" and don't forget about empty fields. Throw an error and stop the program if the number of columns isn't consistent. Which doesn't mean that you can't parse the whole file and output all errors at once (which is the preferred way, we don't live in the 90s any more ;), just don't process the wrong result. And with usable error messages, not just "invalid line N".


http://www.catb.org/~esr/writings/taoup/html/ch05s02.html

paragraph titled DSV style. (Yeah esr, not a fan, whatever...)

Csv sucks no matter what, there is no one csv spec. Then even if you assume the file is "MS Excel style csv" you can't validate it conforms. There's a bunch of things the libraries do that cope with at least some of it that you will not replicate with cut or an awk one liner.


You might have enjoyed DCL under VMS.

It did not immediately succumb to envy of the Korn shell.


What if you had constraints on the CSV files? Suppose you knew that they don't contain spaces, for example. In that case, I don't see the problem with using UNIX tools.


Then you're not actually processing the CSV format, you're processing a subset of it. You'll also likely bake that assumption into your system and forget about it, and then potentially violate it later.

Well-defined structured data formats, formal grammars, and parsers exist for a reason. Unix explicitly eschews that in favor of the fiction of "plain text", which is not a format for structured data by definition.


> The biggest disadvantage of the shell is that, by exchanging data using text, you lose opportunities to check for errors in the output.

That's pretty bad, but isn't the complete lack of support for structured data an even bigger one? After all, if you can't even represent your data, then throwing errors is kind of moot.


Oils/YSH has structured data and JSON! This was finished earlier this year

Garbage Collection Makes YSH Different (than POSIX shell, awk, cmake, make, ...) - https://www.oilshell.org/blog/2024/09/gc.html

You need GC for arbitrary recursive data structures, and traditionally Unix didn't have those languages.

Lisp was the first GC language, and pre-dated Unix, and then Java made GC popular, and Java was not integrated with Unix (it wanted to be its own OS)

----

So now you can do

    # create some JSON
    ysh-0.23.0$ echo '{"foo":[1,2,3]}' > x.json

    # read it into the variable x -- you will get a syntax error if it's malformed
    ysh-0.23.0$ json read (&x) < x.json


    # pretty print the resulting data structure, = comes from Lua
    ysh-0.23.0$ = x
    (Dict)  {foo: [1, 2, 3]}

    # use it in some computation
    ysh-0.23.0$ var y = x.foo[1]
    ysh ysh-0.23.0$ = y
    (Int)   2


Structured shells are neat and I love them, but the Unix philosophy is explicitly built around plain text - the "structured" part of structured shells isn't Unixy.


It's not either-or -- I'd think of it as LAYERED

- JSON denotes a data structure, but it is also text - you can use grep and sed on it, or jq

- TSV denotes a data structure [1], but it is also text - you can use grep on it, or xsv or recutils or ...

(on the other hand, protobuf or Apache arrow not text, and you can't use grep on them directly. But that doesn't mean they're bad or not useful, just not interoperable in a Unix style. The way you use them with Unix is to "project" onto text)

etc.

That is the layered philosophy of Oils, as shown in this diagram - https://www.oilshell.org/blog/2022/02/diagrams.html#bytes-fl...

IMO this is significantly different and better than say PowerShell, which is all about objects inside a VM

what I call "interior vs. exterior"

processes and files/JSON/TSV are "exterior", while cmdlets and objects inside a .NET VM are "interior"

Oils Is Exterior-First (Code, Text, and Structured Data) - https://www.oilshell.org/blog/2023/06/ysh-design.html

---

[1] Oils fixes some flaws in the common text formats with "J8 Notation", an optional and compatible upgrade. Both JSON and TSV have some "text-y" quirks, like UTF-16 legacy and inablity to represent tabs

So J8 Notation cleans up those rough edges, and makes them more like "real" data structures with clean / composable semantics


> It's not either-or -- I'd think of it as LAYERED

You can almost always represent one format inside of another. I can put JSON in a text system, or I can hold text in Protobuf. That doesn't mean that the systems are equivalently powerful, or that a text-oriented system supports structured data.

> on the other hand, protobuf or Apache arrow not text, and you can't use grep on them directly

This is kind of tautological, because of course you can't grep through them, because Unix and grep don't understand structure, because they're built around text.

> IMO this is significantly different and better than say PowerShell, which is all about objects inside a VM

Why?


> if you can't even represent your data

any data can be represented as text


Yes you can represent anything as text, but then you create another problem: parsing the representation. You always need to balance the advantage of sending data in a free format and the need to parse it back into usable input.


Only in the least useful, most pedantic sense of the word. I can "represent" image data as base64 text in my terminal and yet have no tools to meaningfully manipulate it with. This is a gotcha with no interesting thought behind it.


Your example is particularly funny, because one can easily undecode the base64 image, convert it to sixel so that it's displayed in the terminal verbatim, all in half a line of shell pipes. If the image is large, you can also pipe it to an interactive visualizer to look at it. Or, you just "cat" your base64 image as is, together with some recently echoed mail headers and pipe it to mail as an attachement. The tools to do that are standard and come with all unix distributions.

The greatness of unix is that none of the intermediary programs need to know that they are dealing with an image. This kind of simplicity is impossible if the type of your data must be known by each intermediary program. "Structured data" is often an unnecessary and cumbersome overkill, especially for simple tasks.


> Your example is particularly funny, because one can easily undecode the base64 image

I don't see how that's relevant - the fact that a very small number of specific terminal programs can decode limited data formats and communicate over terminal control sequences that only work for that data format doesn't seem to matter discussion of whether the shell or OS supports structured data in general or whether it gives you the tools to manipulate that data.

Moreover, this is actually an example that further proves my point: you can't use classic Unix tools to process that image data. You can't use cut to select rows or columns, head or tail to get parts of an image, grep to look for pixels - of course you can't, because it's not text, and one of the core tenets of the Unix philosophy is that everything is plain text.

Now, this wasn't a great example to begin with, because media is somewhat of a special case (and it's somewhat sequential data, as opposed to hierarchical). Let's take something like an array of transactions for a personal budget in JSON form, where each transaction has fields like "type" (e.g. "credit", "debit"), "amount", and "other party", and "other party" further has "type" (e.g. "business", "person", "bank") and "name" fields. The Unix philosophy does not allow you to directly represent this data - there is no way to represent key-value pairs or nested data and then do things like extract fields or filter on values. Your primitives are lines and characters, by definition.

> This kind of simplicity is impossible if the type of your data must be known by each intermediary program.

I think you're misunderstanding what "structured data" is, because what you just described isn't a property of structured data. I can write a "filter" program that takes an arbitrary test function and applies it to a blob of data, and that program needs zero information about the data, other than that it's an array.

Or, it's possible that you're just not familiar with the concept of abstraction, where programming systems can expose an interface that requires other systems to not understand their internals.

> The greatness of unix is that none of the intermediary programs need to know that they are dealing with an image.

This has nothing to do with Unix. It's trivial to conceive of a shell or OS where programs pass around structured data and are able to productively operate on it without needing to know the full structure of that data.

> "Structured data" is often an unnecessary and cumbersome overkill

What does this even mean? Data is structured by definition. If something doesn't have internal structure to it, it is not data. The fact that Unix decides to be willfully ignorant of that structure and force you to ignore it doesn't mean that it doesn't exist.

I would hope you didn't mean "representing structured data as such is inconvenient" because that would be a very ignorant statement to make that conflates tooling with data format and that has zero empirical evidence behind it.


> there is no way to represent key-value pairs or nested data and then do things like extract fields or filter on values

man, ever heard of awk? It was designed exactly for that purpose.


I'm very well aware of awk. This proves my point - Unix doesn't have a way to represent structure, so the only way that you can manipulate it is with purpose-built tools. There's no way to tell "sort" to sort on a specific field, for instance - sort does not understand structure. Things that are trivial in a structured/typed language like PowerShell or Python require a lot of sed/awk glue in Unix and are far less reliable as a result.

Anyone who has a basic understanding of computing theory understands that shell is obviously Turing-complete and can compute anything that another Turing machine has - which means that being able to hack something together does not mean that the system was designed for that paradigm. You can implement Prolog in C, but that does not mean that C is a declarative logic programming language, and if you find yourself repeatedly re-implementing Prologs in your C programs, you'd do very well to have the self-awareness to realize that C is probably not the right choice for the problems you're trying to solve. Similarly, if you have to repeatedly shell out to a tool that understands structure like awk or Perl or jq, you'd do well to consider as to whether the shell doesn't actually fit the problem you're trying to solve.

There's a reason why there are precisely zero large pieces of software written in bash - because it's an extremely poor paradigm for computation.


reminds me of 'parse, don't validate'


At the same time, the POSIX shell can be implemented in a tiny binary (dash compiles to 80k on i386).

Shells that implement advanced objects and error handling cannot sink this low, and thus the embedded realm is not accessible to them.


Sure they can, Smalltalk and Lisp environments didn't had the luxury of 80k when they were invented.


No, they had the luxury of having much more.


Learn computing history, starting with Lisp REPLs on IBM 704 in 1958!

20 years later we had 64 KB to fit an whole OS and applications on home computers.


I am quite aware of computer history. The Xerox Alto had between 90k and 512k of memory, and a disk drive that managed 2.5M.

Lisp Machines required a 80M disk with at least 512k of memory.

Neither Smalltalk nor Lisp fit in 64k in 1980 (~20 years later). Even the IBM 7094 which ran a very tiny Lisp (1.5) had around 32k 36bit words.


The very first version of Smalltalk was written in BASIC on Data General Nova.

Then, in 1983, there was a version of Lisp for ZX Spectrum (which had 48k of RAM and no floppy).

There was also Rosetta Smalltalk:

> ROSETTA SMALLTALK now runs on the Exidy Sorcerer computer! It requires 48K of memory, a disk, and CP/M.


True, the absolutley first versions were entirely unusable. Glacial is what one of the authors said.

First version of Unix had a 512k disk pack, and no shell -- that 80k binary for shell would have been a dream.

"20 years later we had 64 KB to fit an whole OS and applications on home computers." is quite the claim when we are talking about Unix, Lisp or Smalltalk and comparing them to CP/M with DOS which .. does nothing in comparison, and ignoring literally all other aspects of a computer system.


Unless I got my math wrong, 32 is less than 80.

Also where did I on my comment mentioned Lisp Machines?


"20 years later" -- nobody was running IBM 705 in 1980 for Lisp or Smalltalk. The 7601 was backed by hard disk as well. These machines used quite a bit of paging.

You are purposefully confusing multiple decades of computing. Smalltalk was much later, and required quite large machines, so did Lisp when it became popular on machines like the PDP-10 and ITS which had much more memory, just running Macsyma was a PITA.


We were Lisp on CP/M, ever heard of it?

Also now we are getting MMU and page loading into the argument?


And CP/M wasn’t running the first versions of Lisp or Smalltalk.

Your claim that Lisp and Smalltalk didn’t have the luxury of 80k when they got invented, when intact they did. Much of the programs that ranusing Lisp specifically was VERY memory hungry.

Now your just arguing for the sake of arguing and just being antagonizing. Have fun.


Could Macsyma be run on CP/M on an Altair?


There were other Z80 computers able to run Lisp and CP/M.

Here he goes again with another application I didn't mention at all.


Macsyma on the Simh emulator, Ka10, appears to do it fine under ITS.


Your not sharing the system with a dozen other people running Emacs and Macsyma :-)


Timesharing wasn't part of the discussion.


that's dramatically larger than any pdp-11 executable, including the original bourne shell, and also, for example, xlisp, which was an object-oriented lisp for cp/m

advanced objects and error handling do not require tens of kilobytes of machine code. a lot of why the bourne shell is so error-prone is just design errors, many of them corrected in es and rc


What are es and rc? Can you give some links?


Search for "rc shell" and "es shell"

https://en.wikipedia.org/wiki/Rc_(Unix_shell)

https://wryun.github.io/es-shell/

They are alternative shells, both from the 90's I believe. POSIX was good in some ways, but bad in that it froze a defective shell design

It has been acknowledged as defective for >30 years

https://www.oilshell.org/blog/2019/01/18.html#slogans-to-exp...

---

es shell is heavily influenced by Lisp. And actually I just wrote a comment that said my project YSH has garbage collection, but the es shell paper has a nice section on garbage collection (which is required for Lisp-y data structures)

And I took some influence from it

Trivia: one of the authors of es shell, Paul Haahr, went on to be a key engineer in the creation of Google


Thanks, I didn't know that about es!


David Korn would not agree with you.

"A lot of effort was made to keep ksh88 small. In fact the size you report on Solaris is without stripping the symbol table. The size that I am getting for ksh88i on Solaris is 160K and the size on NetBSD on intel is 135K.

"ksh88 was able to compile on machines that only allowed 64K text. There were many compromises to this approach. I gave up on size minimization with ksh93."

https://m.slashdot.org/story/16351


it sounds like he did agree, even laboring under the heavy burden of bourne shell compatibility


The && and || operators let you branch on errors.


Not like a Lisp REPL, you have to start over with pipes. That's my main issue with Unix.


that's why the rule #1 of robust shell scripts is "set -e", exit on any error. This is not perfect, but helps with most of the errors.


set -euo pipefail, if you're OK with the Bashism.


Since POSIX 2024, `set -o pipefail` is no longer a bashism!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: