This continues our series of student reflections and analysis authored by our research team.
Source Scraping and Coding Confirmation
Emma Lovejoy
When we look for new cases to include in the tPP database, it’s always helpful to find an existing compilation we can pull names and data from. While cases in some existing databases automatically meet the criteria for inclusion (lists specifically dedicated to terrorist acts, political extremism, etc), for other sources each case must be individually investigated, and a determination made as to whether they should be included in tPP or not. For sources like this, it’s important that we take the time to ensure we’re not adding extraneous cases, and that the information being added as up-to date.
When a case is located that we think may need to be included, the first step is to check the defendant’s name against each of our active spreadsheets, to ensure the case really is new to tPP. In cases where we’ve already documented the case, this is an opportunity to see if there are updates to be made; if variables we’d had trouble with in the past are clarified by new source material, if the trial has progressed, etc. If the case does not appear in our data yet, we move on to source collection.
An easy place to start when a case’s inclusion is still questionable is looking at news stories. They’re usually easier to find than court documents, and can give a general picture of the crime and the defendant. Based on what we see in the news, we can usually make a judgement at least on whether or not the case meets the criteria for inclusion, even if the details of their ideology remains unclear. If it is a case for tPP, especially useful articles will be saved as source files, and known information (name, dates, location etc) as well as the dataset it was originally pulled from will be added to the working spreadsheet as a case-starter, to be coded.
If the case could have been included were it not for an exclusionary factor (charges not reaching felony-level, death prior to charging, etc) then the basic case-starter information is filed separately as an excluded case, with sources and an explanation of why. We do this to save ourselves time down the line, if the case should come up again in the future. Given the overlapping content of various datasets, it’s not uncommon for an excluded case to to raise flags on more than one occasion, so having this index improves our ability to work through external data efficiently.
Most recently, we have been working on developing a collective procedure for the scraping process. That is, a system in which each stage of examination is assigned to an individual or group, to expedite the scraping of each document. So far, this assembly-line approach has yielded hundreds of new cases to be incorporated.