Torvalds' quote about good programmer
Accidentally I've stumbled upon the following quote by Linus Torvalds:
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
I've thought about it for the last few days and I'm still confused (which is probably not a good sign), hence I wanted to discuss the following:
- What interpretation of this possible/makes sense?
- What can be applied/learned from it?
I think this question probably has multiple answers that are equally valid. But it's a good question anyway. I love that quote. It expresses why I don't understand programmers who worry about switching languages. It's rarely the language that matters in a program, it's the data structures and how they relate.
Maybe if you take the time making the data structures "elegant" then the code doesn't have to be convoluted to deal with these data structures? I'm probably too dumb to really know the meaning of Torvalds' quote. :}
@JasonHolland That's pretty much it. Once you understand the data structures, the code is almost irrelevant. It becomes a matter of memory and/or reference. The complicated and interesting part is conceptually figuring everything out. I often solve problems and design solutions away from the keyboard.
@RyanKinal But of course the language *does matter*, because it makes it considerably easier to deal with and think about certain data structures. Think about all the languages that specialize in LISt Parsing, for example, or languages that have native support for data structures that have to be hacked into other languages, (sets and sparse arrays come to mind).
Torvalds is not alone in this, by the way: "Show me your flowchart and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won't usually need your flowchart; it'll be obvious." – Fred Brooks, The Mythical Man-Month. "Show me your code and conceal your data structures, and I shall continue to be mystified. Show me your data structures, and I won't usually need your code; it'll be obvious." and "Smart data structures and dumb code works a lot better than the other way around." – Eric S. Raymond, The Cathedral and The Bazaar.
@kojiro Language will, of course, matter for implementation, but there are very few times when you can't express your solution in whichever language you want. Some might be more difficult, or you may have to modify your solution slightly, but it usually doesn't matter much at all.
Very profound quote, and true in many dimensions. Is it smart to write the CSS before the HTML?
Does anyone have an example of a problem solved in each of the two contrasted styles, maybe a kata, which would make the idea concrete?
I have a friend who used to use another quote that I like even more: "Most programmers think about things in terms of how they work. Great programmers think about them in terms of how they break."
I would also add Dijkstra's remark that "...our intellectual powers are rather geared to master static relations and that our powers to visualize processes evolving in time are relatively poorly developed."
It might help to consider what Torvalds said right before that:
git actually has a simple design, with stable and reasonably well-documented data structures. In fact, I'm a huge proponent of designing your code around the data, rather than the other way around, and I think it's one of the reasons git has been fairly successful […] I will, in fact, claim that the difference between a bad programmer and a good one is whether he considers his code or his data structures more important.
What he is saying is that good data structures make the code very easy to design and maintain, whereas the best code can't make up for poor data structures.
If you're wondering about the git example, a lot of version control systems change their data format relatively regularly in order to support new features. When you upgrade to get the new feature, you often have to run some sort of tool to convert the database as well.
For example, when DVCS first became popular, a lot of people couldn't figure out what about the distributed model made merges so much cleaner than centralized version control. The answer is absolutely nothing, except distributed data structures had to be much better in order to have a hope of working at all. I believe centralized merge algorithms have since caught up, but it took quite a long time because their old data structures limited the kinds of algorithms they could use, and the new data structures broke a lot of existing code.
In contrast, despite an explosion of features in git, its underlying data structures have barely changed at all. Worry about the data structures first, and your code will naturally be cleaner.
*the best code can't make up for poor data structures* good gravy is that true
I shouldn't need to care or know about what data structures Git uses underneath really. How is he measuring success here; by how many people use Git and find it easy to use, or by how many contribute code to it?
He's talking from the point of view of programmers making changes to git itself. The end user point of view is completely orthogonal to this discussion, other than easily maintainable code making for fewer bugs and faster feature additions.
@James: He's saying that the software is better (hence easier to use, and used by more people) because the data structures are better. Of course you don't need to *know* about the data structures of software you use, but you do *care* about them, indirectly, even if you don't realize it, because the data structures are what drive the things that you do realize you care about.
+1. This answer puts context on a statement that could otherwise be construed to mean something very different. Anyone who has read a 5000 line monstrosity of a file knows exactly what I mean.
"Worry about the data structures first, and your code will naturally be cleaner.": The Roman statesman Cato (http://en.wikipedia.org/wiki/Cato_the_Elder) used to say "Rem tene, verba sequentur" = "Have the argument clear in your mind, the words will follow naturally". Same thing with programming: understand the data structures and design first, the actual code will follow by itself.
If I am not wrong, the first versions of git where doing the sha of the content, while the newer (+2 years probably) do with the content and headers. This is a data structure change, tha broke the first versions of git.
Which makes this kind of a single-developer version of Fred Brooks: "Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious."
I'd also like to add that bad data structures usually cause bad code, because over time, all that remains is the question "why was it designed this way?" and tacked on top of it are several half hearted attempts to fix the design. These attempts are also usually aborted out of fear because the "fix" inevitably comes down to "breaking" the way it currently works.
I have to say though, that the line between datastructure and code isn't all that clear. To code up a good data structure, you often must have good code in the first place! After all, data structure is really just more code (albeit more modular ,and perhaps reusable)
@ruakh: A comment like Torvalds' *could* have referred only to the internal maintainence of GiT, but in this case, Linus actually sees the data structures as a kind of public interface. That goes against the religion of "data hiding"; but behind it lies a deeper truth: you should expose the simplest, most stable interfaces possible. Linus claims that for GiT, that is the on-disk representation, rather than any function-call API. (The git commands themselves are an API for scripts).
@AdrianRatnapala: Data structures as the public interface, plus data hiding (physical data independence), roughly equals the relational model of data. Its principles apply equally to GIT design and to database design.
Most media lockers (ex iTunes) exploit this fact to create artificial lock-in. They allow data in, but modify it (ex change file names) so it can't be easily extracted. In addition, most meta-data is stored in a platform-specific database format so it's only portable to the same/similar platform.
Code is just the way to express the algorithms and the data structures.
It's not fundamentally *any* different. You have data and do set of operations on it. Member variables and methods. Exactly the same thing. The whole essence of computing ever since the 50's has been built upon that very simple rule that programs consist of algorithms modifying data structures, and it keeps holding true 60 years later. You could also consider programs as *functions*. They take *input* on which they operate to produce *output*. Exactly as mathematical functions do.
This quote is very familiar to one of the rules in "The Art of Unix Programming" which is Torvalds' forte being the creator of Linux. The book is located online here
From the book is the following quote that expounds on what Torvalds is saying.
Rule of Representation: Fold knowledge into data so program logic can be stupid and robust.
Even the simplest procedural logic is hard for humans to verify, but quite complex data structures are fairly easy to model and reason about. To see this, compare the expressiveness and explanatory power of a diagram of (say) a fifty-node pointer tree with a flowchart of a fifty-line program. Or, compare an array initializer expressing a conversion table with an equivalent switch statement. The difference in transparency and clarity is dramatic. See Rob Pike's Rule 5.
Data is more tractable than program logic. It follows that where you see a choice between complexity in data structures and complexity in code, choose the former. More: in evolving a design, you should actively seek ways to shift complexity from code to data.
The Unix community did not originate this insight, but a lot of Unix code displays its influence. The C language's facility at manipulating pointers, in particular, has encouraged the use of dynamically-modified reference structures at all levels of coding from the kernel upward. Simple pointer chases in such structures frequently do duties that implementations in other languages would instead have to embody in more elaborate procedures.
Code is easy, it's the logic behind the code that is complex.
If you are worrying about code that means you don't yet get that basics and are likely lost on the complex (ie data structures and their relationships).
Heh, I wonder if the next generation of programmers will be asking: "Morons once said `Code is easy, it's the logic behind the code that is complex`, what did he mean?"
To expand on Morons' answer a bit, the idea is that understanding the particulars of the code (syntax, and to a lesser extent, structure/layout) is easy enough that we build tools that can do it. Compilers can understand all that needs to be known about code in order to turn it into a functioning program/library. But a compiler can't actually solve the problems that programmers do.
You could take the argument one step further and say "but we do have programs that generate code", but the code it generates is based on some sort of input that is almost always hand-constructed.
So, whatever route you take to get to code: be it via some sort of configuration or other input that then produces code via a tool or if you're writing it from scratch, it's not the code that matters. It's the critical thinking of all the pieces that are required to get to that code which matter. In Linus' world that's largely data structures and relationships, though in other domains it may be other pieces. But in this context, Linus is just saying "I don't care if you can write code, I care that you can understand the things that will solve the problems I'm dealing with".
Every programmer uses programs that generate code. They are often called "compilers", sometimes in combination with "linkers". They take a (relatively) human-readable and human-writeable input, which is usually (but not always) provided in some sort of text format, and turn it into data that the computer can understand as instructions and execute.
Linus means this:
Show me your flowcharts [code], and conceal your tables [schema], and I shall continue to be mystified; show me your tables [schema] and I won't usually need your flowcharts [code]: they'll be obvious.
-- Fred Brooks, "The Mythical Man Month", ch 9.
I think he's saying that the overall high-level design (data-structures and their relationships) is much more important than the implementation details (code). I think he values programmers who can design a system over those who can only focus on details of a system.
Both are important, but I would agree that it's generally much better to get the big picture and have issues with the details than the other way around. This is closely related to what I was trying to express about breaking up big functions into little ones.
+1: I agree with you. Another aspect is that often programmers are more worried about what cool language feature they are going to use, instead of focusing on their data structures and algorithms and on how to write them down in a simple, clear way.
Well, I can't entirely agree, because you have to worry about all of it. And for that matter, one of the things I love about programming is the switches through different levels of abstraction and size that jump quickly from thinking about nanoseconds to thinking about months, and back again.
However, the higher things are more important.
If I've a flaw in a couple of lines of problems that causes incorrect behaviour, it probably isn't too hard to fix. If it's causing it to under-perform, it probably doesn't even matter.
If I've a flaw in the choice of data structure in a sub-system, that causes incorrect behaviour, it's a much bigger problem and harder to fix. If it's causing it to under-perform, it could be quite serious or if bearable, still appreciably less good than a rival approach.
If I've a flaw in the relationship between the most important data structures in an application, that causes incorrect behaviour, I've a massive re-design in front of me. If it's causing it to under-perform, it might be so bad that it would almost be better if it it was behaving wrong.
And it'll be what makes finding those lower-level problems difficult (fixing low-level bugs is normally easy, it's finding them that can be hard).
The low-level stuff is important, and its remaining importance is often seriously understated, but it does pale compared to the big stuff.
Knowing how the data will flow is all important. Knowing flow requires that you design good data structures.
If you go back twenty years, this was one of the big selling points for the object oriented approach using either SmallTalk, C++, or Java. The big pitch -- at least with C++ because that's what I learned first -- was design the class and the methods, and then everything else would fall into place.
Linus undoubtedly was talking in broader terms, but poorly designed data structures often require extra rework of code, which can also lead to other problems.
What can be applied/learned from it?
If I may, my experience in the last few weeks. The preceding discussions clarified the answer to my question: "what did I learn?"
I rewrote some code and reflecting upon the results I kept seeing & saying "structure, structure..." is why there was such dramatic difference. Now I see that it was Data structure that made all the difference. And I do mean all.
Upon testing my original delivery, the business analyst told me it was not working. We said "add 30 days" but what we meant was "add a month" (the day in the resulting date doesn't change). Add discrete years, months, days; not 540 days for 18 months for example.
The fix: in the data structure replace a single integer with a class containing multiple integers, change to it's construction was limited to one method. Change the actual date arithmetic statements - all 2 of them.
- The new implementation had more functionality but the algorithm code was shorter and clearly simpler.
In Fixing the code behavior/results:
- I changed data structure, not algorithm.
- NO control logic was touched anywhere in code.
- No API was changed.
- The data structure factory class did not change at all.