Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Correlation between the use of swearwords and code quality in open source code? [pdf] (h-its.org)
101 points by cpeterso on Feb 12, 2023 | hide | past | favorite | 59 comments


My subjective personal experience:

* good developers seem to be more pissed off by poor code. Seems easy to explain -- people tend to do better when they care about what they are doing.

* even better ones seem to be more accepting people with varying ability to write code and spend more effort trying to help them make the best with their abilities,

* poor developers tend to try to make up for their perceived inadequacies by trying to look more professional in other ways (like behaving "professionally", dressing better, using marketingspeak on meetings, etc.)

* on the other hand large portion of developers have inflated egos and thing they are super good while really writing crappy code and not even understanding why it is crappy. I worked with two guys at two separate companies who were both lead developers and were saying they are best developers in the world. Both super confident and this helped them to advance. One of them was fired when the entire library they brought in as an "asset" failed spectacularly because it parsed XML messages by extracting "n characters from m line at n offset" rather than doing actual XML parsing.


> poor developers tend to try to make up for their perceived inadequacies by trying to look more professional in other ways (like behaving "professionally", dressing better, using marketingspeak on meetings, etc.)

I would actually look more into their "professional" behavior purely from a code viewpoint. As an example, I once had a coding interview submission that I was looking through and at first sight it seemed everything is there, tests, code comments, framework patters. My first thought was "Oh nice, they really are thorough". But looking deeper at it, these things hid the actual problems. E.g. The function interfaces were poor and mixed side effects when not necessary, tests tested _something_, but not what was important. And the algorithmic challenge wasn't even really attempted. This made me acutely aware of how easy this person could have passed a less technical interviewer, as everything looked professional. The effects are worse that someone whose obviously a junior or poor coder, their code looks off at first sight.


> poor developers tend to try to make up for their perceived inadequacies by trying to look more professional in other ways (like behaving "professionally", dressing better, using marketingspeak on meetings, etc.)

Don't forget:

- trying to solve things in a "clever" non idiomatic way instead of being clear and idiomatic

- gravitating towards non coding tasks


I remember there was a study (can't find it quickly) that said people who swear are more honest and lie less. If that is true then it's not surprising for such qualities to somehow affect other things.

From personal anecdotes I would add that swearing seems to be associated with action and getting things done. People who work with their hands and tools: plumbers, car mechanics, etc, swear a lot. But even in academia I've met quite a few very intelligent people who swear in private conversations and one common characteristic among them is they hate when things are not moving.


I met 3 coders who swore more or less continuously while coding. But that's 3 out of probably close to 100 I have been sitting close enough during 30 years. I do it myself but probably less than at one hairy problem a week.

At least one of the 3 does not produce better code than average. One produces clearly more code than average. Not really bad code, but we are still fixing his bugs quite a while after he left the company. The 3rd one I don't remember, just shared an office, but did not work with his code.

Edit: To be clear I talked about swearing aloud. Having swear words in the code I have not seen with 30 years of experience. Except some rare examples in open source which get then a lot of publicity.


Swearing while coding: “Who the f#ck wrote this code, looks like cr#p, what the… oh, I wrote that five years ago… why the f #ck did I write it like that, that makes no sense, what the… oh, that’s why! …”


The only swear words in our code is essentially "workaround for shit we can't change/fix" or "strong warning to not change this behaviour or something will break". Like

    # as of 5.x libvirt shits on intermediate certs included in certfile and need direct path
or

    # once a month restart, this shit leaks


My favorite are the function documentation in some of Apple's headers.

Apple's CoreFoundation always irked me because it didn't do NULL parameter testing. Try to CFRelease(aNullObject)? You crashed, son.

I asked one the of the CF engineers about this and I was told it was a performance thing. (I can kind of see that ... caller tests that the CFCreateFoo() did not return NULL and then can make any number of subsequent calls without having to re-sanity-check that value.)

Regardless, look out for Apple headers with documentation comments similar to:

    If properties parameter is NULL returns false, does not crash like CoreFoundation.
Not cursing, but what can you do in public headers...


Measuring code quality is always fraught with issues. That said, I could imagine that code bases that allow for swearing generally have a more honest and open communication culture, so swearing may be an indirect indicator of this. Which may be a better explanation than the hypothesis that "the use of swearwords constitutes an indicator of a profound emotional involvement of the programmer with the code and its inherent complexities, thus yielding better code based on a thorough, critical, and dialectic code analysis process."


Well that sounds plausible, it seems like we could come up with all sorts of stories to say basically anything here. Maybe swear words keep other contributors away, resulting in smaller dev teams that can have more consistent standards.


Well, your hypothesis is easy to check, one needs to repeat the study controlling for a dev team size. It would be harder to measure culture's openness and honesty.


Yes, projects were the contributors care about their projects. Corporations rarely allow swear words in their code bases.


You are under impressions that corporations care about code in any way. They don't. They only care about business outcomes. If you mention "code quality" they will just stare at you. It is better to talk about "maintainability" or "development velocity" etc. It is teams/team managers who care about code quality.

It is people who work at corporations that self-censor themselves to keep the professional look of themselves.

I have never had any corporation disallow me from swearing in the codebase.


No, the opposite. People who care use swear words. Corporations neither care or allow swear words.


Abstract:

> One of the most fundamental unanswered questions that has been bothering mankind during the Anthropocene is whether the use of swearwords in open source code is positively or negatively correlated with source code quality. To investigate this profound matter we crawled and analysed over 3800 C open source code containing English swearwords and over 7600 C open source code not containing swearwords from GitHub. Subsequently, we quantified the adherence of these two distinct sets of source code to coding standards, which we deploy as a proxy for source code quality via the SoftWipe tool developed in our group. We find that open source code containing swearwords exhibit significantly better code quality than those not containing swearwords under several statistical tests. We hypothesise that the use of swearwords constitutes an indicator of a profound emotional involvement of the programmer with the code and its inherent complexities, thus yielding better code based on a thorough, critical, and dialectic code analysis process.


The opening sentence is really the best part of this thesis.


There has outside coding been correlation studies between swearing and higher intelligence. Where those swearing generally were found to be more intelligent fluent in languages.

May it be that swearing coders are more intelligent? If coders are more fluent in use of languages are they also more fluent in code?

Articles swearing intelligence https://www.sciencedirect.com/science/article/abs/pii/S03880... https://medicalxpress.com/news/2017-02-higher-language-relat...


I wonder how this applies to foreign languages. I swear quite a lot (not something I'm particularly proud of, or that I think is an indication of any other positive/negative quality I may or may not have). But I pretty much only swear in my native language, which is not English. I simply don't feel confident enough in English to use swearing appropriately, if there's such a thing.

However, coding is done in English. Period. And code comments are written in English as well. So you won't find a lot of profanity in my code comments or commit messages - even though there might have been some if English had been my native language, or the general practice was to write code, comments and commit messages in the local language.

I'd argue that swearing in particular requires a fairly deep understanding of the nuances of a language that general use does not - misplaced swearing makes you look a lot more like a fool than many other mistakes.


Funny thing, I swear more in languages other than my native tongue.

I guess I don't consider swearing in foreign languages to be real, similarly to how you do things in a video game which you wouldn't do in real life.


Hah! That's hilarious. I have the same feeling with foreign currencies. Prices are just a random number, and you happily hand over those funnily colored bills without thinking too much about it.

Yes, trips abroad tend to be unreasonably expensive for reasons I have yet to figure out :-)


Exactly right! I know perfectly well that PLN to GBP ratio is 1 : 5.5, but whenever I'm in the UK, spending £100 on something feels like spending 100 PLN, not like 550 PLN.

Now this wouldn't be a problem if the actual purchasing power[0] followed the inverse of the exchange ratio (e.g. a beer costing 5.50 PLN in Poland would cost £1 in UK), but in fact things are some 2-5x more pricey in the UK (e.g. that 5.50 PLN beer is £4 in the UK), so my trips to London also end up unreasonably expensive for uncertain reasons :).

--

[0] - Not sure what the correct term is for the ratio between typical price of a common thing in one country to the price of the same in another country, denominated in prior country's currency. E.g. typical price of bread in Poland, given in PLN, divided by typical price of bread in UK, given in GBP converted to PLN using current exchange ratio.


So to take this further, is there a swearing exchange rate? Eg a German swear word is worth two English ones?

I’m told (by a Russian) that Russian swearing is very intense, so that may be a concise way to go.


This is a very real phenomenon. When a large part of your exposure to English comes from movies, video games and media personalities, it can create a warped perception of what is acceptable and what isn't. Swearing in my native language immediately gives me disgust, while English swear words do not produce any emotional response.


> However, coding is done in English. Period. And code comments are written in English as well.

This was not the case at my first tech job. The office communication, the reference books, the comments and the sometimes even the commit messages were often not in English. The same is true of many large projects on Github, even today.


If that’s true, Australia is full of linguist polymaths.


Am I misreading or do they really equate swear vocabulary size with swearing frequency without providing data for this claim? Of course a large vocabulary in one general area tend to correlate with large vocabulary size in other areas? Especially among young university students when they haven't spent a lifetime dedicating themselves to one field.


> There has outside coding been correlation studies between swearing and higher intelligence. Where those swearing generally were found to be more intelligent fluent in languages.

I guess they didn't visit the casino yet.


This is... not at all a useful analysis, to put it lightly.

The word "bug" or "defect" appears ZERO times in this study.

Any study that completely ignores defects in measuring "code quality" is at best misguided, and at worst could drive change that actually results in increased defect density.

High quality code can simply be defined as code that has few bugs. Unfortunately that's a little harder to measure, but they could've at least analysed the issues for a start.

That said, I don't think there's going to be any correlation between the presence of swearing and the defect density.


The less code you write the fewer bugs, so LOC should be a good indicator of code quality :P

But bugs alone says nothing about code quality, the more you look, the more bugs you will find. More popular software will have more bugs (found).

Style guides are not the same as quality. Some people argue about line length, white space, and where to put line breaks, but that has nothing to do with the quality of the code/software, but it's easy to measure.

Do you understand what the code does ? How easy is it to make a change ? Comprehension and maintainability probably results in better code/software quality, but is difficult to measure.

Maybe if we measure the average bug count per user, as well as let users assess/rate the software - does the code/software do the job it was written for?


> High quality code can simply be defined as code that has few bugs. Unfortunately that's a little harder to measure, but they could've at least analysed the issues for a start.

It is nearly impossible to measure directly. One needs to find all the bugs to measure defect density directly. There are indirect ways, one could use a proxy variable, and they did.

They used static analysis as a proxy variable, this choice has some traps to avoid, but bugs from bug-trackers or CVE's is highly dependent on a development practices and popularity of the code, they can (I'd say must) correlate with swearing by themselves. To deal with that one would need to formulate a causal hypothesis for real and to test it for real. Such a study is not for a bachelor thesis, it would be master's at least.

Though I'd like to see authors to sophisticate their techniques enough to dig out a causal explanation.


The tool they use for evaluation (softwipe) is discussed in another paper, but seems to basically count/combine the results of a set of different static and dynamic analysis tools (ASan, UBSan, cppcheck etc)

Presumably those results should correlate somewhat with bugs and defects.


Note this is a correlation analysis. Before draw any causal effect between the result and Developers' identity, the data itself may include bias.

To give an example of possible bias that totally not related to developers' identity. From Figure 3.9 and Figure 3.10, Softwipe scores over LoC have a very different pattern from star-repos, which could be:

If whether a developer uses swearwords is statistically independent, then a project with more contributors are more likely includes swearwords. More contributors may correlate with having a coding style guide which has causal effect on the score from SoftWipe tool (https://www.nature.com/articles/s41598-021-89495-8).

Be careful when trying to explain data :)


Came here to say this.

I wouldn’t be surprised if swearing codebases are better in open source, but I would be surprised if swearing in your codebase makes it better. (This paper only works to prove the former not the latter.)

The obvious traditional opinion is that swearing indicates a lack of intelligence and therefore worse code. I’m just saying that this is on the front page because it’s spicy (the naively-inferred causation goes against the traditional intuition).


I've worked with great developers that swore a lot, and great ones that didn't.

Main difference I observed was that the non-swearing ones were seen as less intimidating and more approachable. They were often more helpful to junior members of the team in terms of providing advice or guidance.

The swearier ones were left alone more, and the swearing was seen a bit like wearing headphones (i.e. I'm focusing and busy, don't you dare interrupt me).


Using swear words often, in public settings or without any real distress situation is in my opinion a mark of the uneducated / "bad" upbringing.

Sure, you let some steam out. I understand that but why put in writing?


>Sure, you let some steam out. I understand that but why put in writing?

Let off steam / save mental effort regulating what you say / indicate to the person you are speaking to that you are comfortable being 'unprofessional'

I also know someone who intentionally swears a lot in interviews as a filtering mechanism, that's probably less common though.


It is like yelling in public sometimes you need to do that so you get point across but if do it constantly you are just weird.


Even though it's a bachelor thesis, implying a fairly lax academic standards, it still looks like a completely bogus research based on largely random "quality" criteria.


First thing that came to my mind was what is considered quality code. Here is an excerpt on how they filtered high quality git repos from Github based on number of stars:

  > However, one can assume that repositories which are of interest to people and are more widely used at least exhibit a decent level of code quality. The 4 star boundary was chosen arbitrarily. The rationale behind that was only based on the assumption that repositories are most likely pareto distributed according to stars and quality, meaning that even excluding repositories with only a small amount of stars will exclude most GitHub repositories and yield more high-quality repositories.
What do you think? Are these assumptions correct?

Edit: added a missing word


I think the assumption that makes more sense is that amount of stars directly correlates to usage of emoji and variations of “blazingly fast” in the readme.


https://words.filippo.io/dispatches/whoami-updated/ found a way to enumerate all github users. Shouldn't be too hard to enumerate all repos from there, count stars, and check the pareto distribution assumption?


> It is very important to note that small p-values do not guarantee that the results are replicable or that statistical significance implies practical significance [18]. This means that swearing will not automatically improve the quality of your code. However, a study showed that swearing in the workplace acts as a form of stress relief [1], which in turn could then improve focus and therefore code quality. This might be a possible explanation for the findings.

Swearing is a means of stress relief, leading to better code. There are no control groups, but that is an interesting conclusion.


"...These tests, combined with our visual analysis of the data yielded the result that repositories containing swearwords exhibit a statistically significant higher average code-quality (5.87) compared to our general population (5.41)..."

The swearing is a proxy for caring? Not that non swearing don't care but the ones that swear are putting their soul and bones into the caring?

Other proxies the researches could have used:

- Hours extra in the office and code quality.

- Professional industry certifications and code quality.

- Attendance of Software conferences and code quality.

- Consumption of coffee versus tea and code quality...


Takes a lot of work to advise theses, and Bachelor's theses are at bottom of list, but this student would have been served well by being encouraged to move all these page-long sections explaining terms and statistical tests to an appendix, or just replaced with a brief description and citation.

Nevertheless, still pretty good thesis for a Bachelors student. I taught statistics to both undergrads and grads in social science and had very students who showed this level of curiosity into methods and then did the work to apply the methods.


The Yandex data breach the other day revealed swear words and racist language in their source code. I'm not sure what that says about their code quality.


The conclusion is clear: add more swearing to your comments, or preferably: your identifiers. That'll improve the code quality greatly. Your manager will count them during the next performance review.

More seriously: I get that people upvote this for fun, but all these studies are so bad. What's the fracking prior, may I ask, especially the one that relates the article's measure of swearing to the measure of code quality?


“This comparison was done by running multiple hypothesis tests, such as the Kolmogorov-Smirnov test. These tests, combined with our visual analysis of the data yielded the result that repositories containing swearwords exhibit a statistically significant higher average code-quality (5.87) compared to our general population (5.41).”


Both are rooted in confidence. If you know your shit, and you know it well, you're going to write good code. And if you know your shit, you'll find a lot of garbage code written by retarded pretenders who'd better have chosen another profession where they, at least, wouldn't stand on your way.


> One of the most fundamental unanswered questions that has been bothering mankind during the Anthropocene is whether the use of swearwords in open source code is positively or negatively correlated with source code quality. To investigate this profound matter we...

Rofl


The two most relevant plots are on pages 41 & 42 of the PDF (labelled pgs 29 & 30).

The Y-axis is code quality, X LoC. First graph is swear-word code (clumping at top), second is non-swear (bell curve).


Our study only considers open source code written in the C programming language and found on GitHub. While it is technically feasible to extend this to C++ using the same crawling and data analysis scripts, we decided to disregard C++ code for the sake of simplicity and due to time constraints. However, we are agnostic as to whether C++ programmers have the same mindset as C programmers or are politically more correct. Therefore, we are cautious with respect to generalising our findings to the object-oriented programming community.

C is an understandable (if very dated) university-level target... but I can't help but be very curious if that's being used as plausible-deniability cover for the fact that C is typically used in kernels, drivers, and other low-level work that lends to a type of mindset that is tactile, moves problems out of the way with a scythe, and routinely swears up a storm in the normal process of getting things done.

C++ is a very different beast, as are all the other languages. Rust I would definitely like to see. Scripting languages? Hoo boy do I want an analysis of PHP please. Ruby would be extremely interesting. Erlang provokes morbid curiosity. Python would probably win (ahem, outrank everything else) by simple virtue of being the most popular.

So this is arguably only a fair comparison if v2.0 looks at lots of different languages. Encore, please?

NB: the output charts on page 41 and 42 are interesting. The non-swear quality graph looks pretty random to me, but there are a couple fascinating little hotspots on the swear graph around 11k-30k LOC where quality ranks at 6-8.5 and almost looks like a wide diagonal line tracking quality downward through 15k-30k LOC. Might this be a outlying companion to the Ballmer Peak (https://xkcd.com/323/)? Incidentally, these graphs render nicely side by side using Chrome's 2-page view (menu, top-right).


FOO BAR is a derivative of FUBAR, F*cked Up Beyond All Recognition.


IgNobel caliber research.

The only programmer I knew who peppered his code with obscenities was so notorious for bugs our QA engineers used to joke he was a one-man job-security program for them.


consider me fucking amazed


>use of swearwords and code quality

I guess it depends on the words.

By the time you damn it all to hell it usually doesn't have far to go.


this paper only has one author, so i assume the use of "we" throughout is the "royal we"? i thought papers were meant to use constructs such as "github was searched", rather than "we searched github", but perhaps that's just we getting old.


Passive voice is considered bad and gets annoying to read ("GitHub was searched"). "We" is common regardless of the number of authors, though theses are a bit odd in that the work is often collaborative but the document is solo authored.

Just say what you did. "We searched GitHub."


These conventions are field-specific. In mathematics and computer science, the almost universal convention is to use the first person plural.


TL;DR:

> swearing will not automatically improve the quality of your code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: