Figuring out how to do what the project aims to do is why I’m a student again. I know how to analyze texts the old-fashioned way – reading them carefully, trying to make sense of various elements of them and the connections between those elements – and I know how to analyze ‘structured data’ (variables in columns, cases in rows) statistically. But learning how to analyze texts statistically – well, that’s new for me.
The biggest practical challenge is trying to think through what one would do, if one was going to do the analysis ‘the old-fashioned way’. It’s easier to just sit down and do it, than it is to say exactly what you’re doing – or, more, what you would do in a hypothetical analysis.
A standard, basic text analysis begins with turning a text into a ‘bag of words’ – putting every word in your corpus into a column, with every document a row, and some measure (presence/absence, count, weighted count) of that word in each document as the values. One isn’t restricted to including just individual words, nor is ‘individual word’ anywhere near as simple a thing as it may seem. But that’s the most basic structure.
A common way to then find out what patterns are in the data is to use ‘topic modelling’. Generate a model of the ‘topics’ discussed in the corpus, by determining which words are consistently found together in the texts in the corpus. For the kind of thing this project aims to do, that sounds like an excellent place to start. It’s a common method for a reason.
Trying it out in a ‘pilot test’ found some interesting results, but showed that it wouldn’t be sufficient. There are more topics discussed in the texts at hand than the ones we’re interested in. So, a key challenge to be met is to figure out how to extract the sentences that talk about, or meaningfully characterize, children and childhood. We can then try modelling the topics in those extracts, rather than all the topics in the corpus.
Three areas of Natural Language Processing suggest themselves as useful guides to doing this:
- opinion mining – figuring out from a passage who holds what opinion about which aspects of what phenomenon, preferably why, and (where relevant) when and where they do so
- argument mining – figuring out from a passage what claim is being made and what reasons or evidence are being offered in support of that claim, as well as what reasons or evidence are offered as attacks upon it, all of these steps depending on being able to recognize when some text entails another (if one, then the other) and when some text contradicts another (if one, then not the other)
- knowledge graphs – what are the relations that hold between words or phrases in a language, both in general (what kinds of relations are there) and in specific cases (which relations hold between which specific words or phrases)
But figuring out which of these works best for the purposes of this project remains on the ‘to-do’ list.
Now, you might ask, if this is so complicated and requires learning all this new stuff, why not do it the old-fashioned way? Forget about the statistics and the computers and just get reading? Well:
- There’s so much to read. There are thousands of primary sources from the past couple of millennia to read through and many thousands of works discussing them. There are over 1,000,000 academic articles on children published just in the last 10 years, and that doesn’t include anything published in newspapers or non-academic magazines or written online. Just reading even a tiny fraction of that would take years and then more time to analyze properly. The alternatives are hiring an army of readers (beyond my pay grade) or using a computer.
- Not everything is written in languages I can read. People have written on children and childhood in every written language and, unfortunately, I can’t read every written language (Universe? Care to change that?). Nor can I rely on everything having been translated into English, or any other language I can manage. Computers don’t have that problem. There’s still a big issue about how to make text in different languages ‘commensurable’ with one another – how to make them all work in a single analysis – but getting the text into the analysis in the first place isn’t the challenge.
- The firehose of new writing is still pumping. There is more material pertinent to a project like this published every day than anyone could read in a year. If I can build the kind of model of discourse about children and childhood that is envisioned in The Goal, then a computer could keep processing new texts and automatically ‘updating’ the model, to see if something new emerges or to monitor how the long established patterns are evolving. And only something automated could hope to come close to staying on top of everything coming out.
- This is how you make the evidence for your conclusions explicit, so other people can replicate them, criticize them, or try different things. In a statistical model, the variables, the weights on those variables, and the relations between the variables are all laid out. That’s what the model is. It’s not like one couldn’t do that all by hand, actually annotating one’s texts in detail and making all those steps in one’s reasoning public – but that’s not what is typically done. The usual case is to marshall illustrative evidence for one’s thesis, as part of a narrative. There’s nothing wrong with that. It certainly doesn’t make one’s conclusions false. But, in a case like this, where a big part of the point is to get away from impressionistic overviews (like the examples shown in The Goal) and try to be more comprehensive, I want to make the evidence for my reading more explicit.