Hadley Wickham's Reading List
Hadley Wickham is Chief Scientist at RStudio, and an Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University. He builds tools that make data science easier and faster, including the famous tidyverse packages for the R programming language. He was named a Fellow by the American Statistical Association for "pivotal contributions to statistical practice through innovative and pioneering research in statistical graphics and computing".
Open in WellRead Daily app →Computer Science for Data Scientists (2018)
Scraped from fivebooks.com (2018-08-09).
Source: fivebooks.com
Gerald Jay Sussman, Harold Abelson & Julie Sussman · Buy on Amazon
"One interesting anecdote is that MIT no longer uses this book to teach its introduction to computer science . They’ve switched to Python instead of Scheme, which is the language taught in this book. The reasoning behind this is that the world doesn’t need more computer scientists; it needs some, but, by and large, what it needs is engineers who know how to use programming languages and achieve a goal, rather than thinking about the atomic constituents of computer science. But this book is very useful for somebody like me, with experience in high-level engineering languages, like VBA, PHP and R. They’re incredibly useful languages, but ones that computer scientists generally disdain, because they’re not theoretically pure or beautiful. This book shows you how languages can be constructed. The most valuable thing it gives you is confidence and knowledge to go and create your own programming language. You get a very good understanding of some of the trade-offs that you have to make when designing languages. For example R does a lot of things that are very unusual among programming languages, and some of them could be considered mistakes, but a lot of them exist because R is trying to achieve a particular objective, and was thus designed following specific and sensible constraints. Another similar and also interesting book is Concepts, Techniques, and Models of Computer Programming , which explains all the models of computer languages and how they fit together. But it’s even more complex than Structure and Interpretation of Computer Programs , so I’d stick with that choice for somebody getting started. I would not describe Scheme as a useful language. It comes back to the question of why you should use one programming language over another. You should not make that decision based on the technical merits of each language, but instead based on the community of people who use it and are trying to solve problems like yours. The community of people using Scheme today is small, and somewhat esoteric, but there are interesting ideas to be learned in the language anyway. And it was very influential in the design of R. R itself is a hybrid of S, a language designed in the 1970s from a pure statistics standpoint, and Scheme; so I learned it to satisfy my curiosity about why the creators of R thought that Scheme was so great. Finally, Scheme is a functional programming language, rather than an object-oriented one; and functional programming is currently experiencing a resurgence of interest."
Steven S. Skiena · Buy on Amazon
"To me, this book is an illustration of the power of names. Today, in the era of Google, if you know the name of something, you can find out about it with a simple search. But if you don’t know of what you’re looking for, it suddenly becomes much harder to find it. Having in the back of your head the names of common algorithms that help you solve problems is really powerful. When you identify a new problem, it helps you to come up with ideas, for example to use breadth-first search , or a binary tree , etc. Yes, it’s good to acquire a sense of that. A lot of statistical theory is about measuring what happens to mathematical properties when some variable x goes to infinity, without thinking about what then happens to computational properties. But if your algorithm needs n² computations, it doesn’t matter if x goes to infinity, because you’ll never be able to compute that."
Andrew Hunt & David Thomas · Buy on Amazon
"This is about the craft of software development, and thinking about how to produce good code. As the name suggests, it’s a very pragmatic and hands-on book. It really helped me on my journey as a software engineer, to be able to write quality code day in and day out, and be confident that it’s going to work correctly. It’s something that we never really talked about in my computer science education, and it’s certainly something that statisticians rarely think about. The goal is to turn an idea in your head into code that works, and that you can share with others. I think there are three main parts. First, for code to be good, it has to be correct and do what you think it does. Ideally, you want to verify that correctness somewhat formally, by writing unit tests . The idea of unit tests is the same as double-entry bookkeeping: if you record everything in two places, the chances of you making a mistake in both places on the same item are very low. So unit tests don’t guarantee that your code is correct, but they make it much more likely to be correct. But the requirements of your code will also change over time. So the second part, and maybe the bigger one, is to write code that will be correct in the future, i.e. easy to maintain. For example, it’s very important to write code that clearly communicates its intent; because you will come back to it six months later, having completely forgotten what you were trying to do. So the easier it is to read your code and understand what’s going on, the easier it will be to add new features in the future. “When writing code, you’re always collaborating with future-you; and past-you doesn’t respond to emails” The third part would be to make sure that it’s fast enough, so that it doesn’t become a bottleneck. It can be easy (and fun!) to get carried away with this, and obsess with writing code that’s exponentially faster. The important thing is to make sure that nothing is overly slowing down execution, to the point of interrupting the flow of your analysis, or meaning that your program has to run overnight. But it doesn’t matter how fast your code is, if it’s not correct and maintainable."
Dustin Boswell & Trevor Foucher · Buy on Amazon
"The problem with writing readable code isn’t to identify the problems; you can tell easily if your code is understandable or not. The challenge is to know how to make it better. The software development community often uses this idea that code ‘smells,’ to say that it’s badly written. What I liked about this book is that it gives you a series of techniques to make that smell go away."
Joseph Bizup & Joseph M. Williams · Buy on Amazon
"Similarly, it’s easy to look at a sentence or a paragraph and to say that it doesn’t make sense or is badly written. This book gave me the tools to analyze a text and identify the reasons why it doesn’t work, for example stating the topic of a paragraph only in the middle of it. I found it very useful to consciously analyze my own writing. That’s important with programming, because obviously you’re communicating with a computer, but more importantly you’re communicating with other humans. And humans are much harder to work with, because you can’t write unit tests to check their understanding. And even if you’re the only programmer working on a particular project, you’re actually always collaborating with future-you; and past-you doesn’t respond to emails. “Writing well and describing things well is very valuable to a good programmer, and even more to a data scientist” Knowing how to write clearly helps you to write code clearly, and also helps you writing good documentation and explain the intent of what you’re doing. Even very good code will only ever tell you how something has been implemented; it won’t tell you why a particular technique has been chosen. Writing well and describing things well is very valuable to a good programmer, and even more to a data scientist. It doesn’t matter how wonderful your data analysis is, if you can’t explain to somebody else what you’ve done, why it makes sense, and what to take away from it. Partly because of another book that nearly made it onto my list: Domain-Specific Languages by Martin Fowler. It talks about the idea of writing a small language inside another language, to express ideas in a specific domain, and the idea of ‘fluent’ interfaces, that you can read and write as if they were human language. There have actually been attempts, for example by Apple, to write programming languages that were exactly like human language, which I think is a mistake because human language is terribly inefficient, and relies on things like tone and body language to clarify ambiguity. But thinking about how you can make a computer language as similar as possible to a human language is important. It can take simple forms, like thinking of functions as verbs, and objects as nouns, so you can draw on the grammatical intuition that comes from human language. Another thing I’ve been exploring lately is the question of foreign languages. The tidyverse gives you access to all of these verbs, but they’re all in English. Should we have translations of the tidyverse? Could we have a Spanish tidyverse, with Spanish equivalents of the verbs? Of course it raises many problems, the biggest one being that 75% of the resources available on sites like StackOverflow are in English, so the answers wouldn’t be universal anymore. But that’s an interesting area where we’re running small experiments; there’s a group of Spanish speakers working on a translation of the R for Data Science book, which includes translating some of the datasets that are used in it. I’m very interested to see where that goes, and how useful it can be to aspiring data scientists everywhere, especially when R is quickly democratizing access to the subject, well beyond the academic world."