Data Scientists as Editors
I came across this intereseting essay from Moritz Stefaner in which he outlines his perspectve on the recent focus on “storytelling” within the data visualization community. I largely agree with his thesis, especially his point about viewing a visualization designer as an author, one who brings to bear editorial decisions about what data to use, where to focus, and what to exclude. There are degrees of editorialship, but even the most quantitative methods include some degree of human decision.
Stefaner writes:
Let’s make no mistake — even a very data-heavy, “sober” representation of data has an author who made clear decisions on what to include or not, what to combine, or not and what to prioritize. And the same holds for the underlying dataset. So, fully acknowledging the role of authorship, with all the journalisitic responsibility it brings, is an important result of this line of thinking for data visualization.
Acknowledging that role is critical, but not just for visualization. I would more strongly emphasize what Stefaner mentions almost in passing: “And the same holds for the underlying dataset.”
I was making a similar point yesterday while debating over coffee with a colleauge who works with machine learning and data mining. He resisted the human element in visual analysis, to which I argued (and he didn’t agree) that even his “non-visual methods” had a human-in-the-loop. Who decides which questions to explore? Who decides which data to use to answer the questions? Who decides which features are used and which aren’t? Who decides which features to derive/construct? Who decides which labels are the important ones to predict/learn/discover? How are those predicted/learned/discovered insights used? I’m confident that not all of these steps are fully automated, entirely quantitative, and performed with no human intervention. If nothing else, the software they use is designed by human developers who make decisions about how the process should work. In every data-driven process that I’ve seen, there is a point (usually, several points) where decisions are made by humans. No amount of computation following those points can remove the editorial impact of what those humans decide. There is nothing wrong with this state of affairs; rather it is an essential element of most data science activities. However, we must be aware of this fact and remain open to the possibility that our discoveries have been influenced by the process by which we uncover them.