The idea is to take a great gob of texts that talk about children and process them in such a way as to extract how they talk about children, the different ways the texts characterize children, and to see if there’s a way to organize those different characterizations. Doing this is called defining the “latent semantic space” of children.
I was able to make a first, exploratory, attempt at this last fall, looking at the version of Rousseau’s Emile available from Project Gutenberg. I explicitly considered this a prototype. Could I extract the kind of information I wanted from just one text, before trying to look at hundreds or thousands of texts? If you’re interested in what I found, here’s the full paper for your review. I welcome any comments!
From the point of view of sharing this with readers, there are two points I want to make.
On Methodology: The #1 charge leveled against any attempt to look at anything quantitatively or computationally is that researchers are trying to hide their biases or standpoints behind a false front of objectivity and value-neutrality, or otherwise find some false certainty, a charge that is often summarized in the word ‘positivism’. The charge isn’t always unfair. There are, in fact, researchers out there who are sufficiently naïve, mendacious or delusional to think that their results simply are ‘what the data says’ in some entirely unproblematic sense and who get very indignant if you suggest otherwise. But most … ok, many … researchers are aware that their results are, at best, what this data says, when asked this question, in this particular way. Indeed, they are painfully aware of all the many decisions they had to make to get to the point where they had the results they are presenting and worry that their results may be fragile to some factor that they have inadvertently or unknowingly failed to consider. They do the best they can to flag the decisions they made, to test the sensitivity of their results to those decisions, and to identify in advance all the ways they can think of in which their results could turn out to be misleading. But, ultimately, the point is to share what results you got, not because those results are the end of the story, but because the story is made up of all the results that get shared.
In this light, this paper is almost entirely devoted to documenting all the different points on which I could make decisions and what decisions I made. Only near the end does it present and interpret some results, based on those decisions. It doesn’t put a whole lot into the interpretation, because the main point of the exercise wasn’t to get to any definitive interpretation, but only to see if I could get something interpretable that actually addressed my question. Indeed, the main things I learned from the work reported in this paper were that,
- Yes, I could get some interpretable results that were at least pertinent to my question, if not really what I was looking for; and,
- There were definitely decisions that I made that I need to revisit, to get more appropriate results.
Certainty this is not!
What to do next: In the work reported in this paper, I applied some ‘topic modeling’ techniques to the Emile. The ‘topics’ I ended up with are actually quite a reasonable summary of what Rousseau discusses in the Emile, if self-evidently debatable. But my intention wasn’t so much to extract the latent semantic structure of the Emile itself, as to extract the latent semantic structure of how children are characterized in the Emile. The results I got were due to the single most important decision that I need to reconsider: I used the whole text of the Emile. Given that, it only follows that any structure I found would be structure that applied to the whole book. So, if I want just the structure of how it characterizes children, then I need to start by extracting the parts where Rousseau does characterize children. I don’t want the whole text. Figuring out how best to pull that information is definitely the next step and leads directly into the techniques of ‘opinion mining’ and ‘argument mining’ discussed early on in the paper. See? Learning.