This continues our series of student reflections and analysis authored by our research team.
On the hardship of managing big data
Megan Roques
As the semester is coming to an end, it is essential to sit back and review the work that has been done. One of the tasks the class has taken on has been reviewing the validity of all of the project’s case files. The cases are separated by year and each student signed up to review one year worth of cases.
Before starting this assignment, I believed that the time spent on this task could be used in adding more cases to the database or scraping documents to find more cases. However, I was wrong. I did not take into account the hardship of management of big data.
It would be impossible for one person to go through all the folders in a timely manner. Additionally, keeping the source files up to date should be a task that is done often. Sometimes, a document that may have been accessible 3 years ago may not be available now because of a company shutting down. This instance could happen when finding sources from non-governmental entities such as finding an article from a local newspaper’s website. Once, a document goes missing, it should be a priority to replace this document with another one that contains the same (or even more!) information.
Most of the document replacing process was self-explanatory such as checking to see if the case was actually in our records or adding more documents if there are not enough. One of the steps, however, peaked my interest: make sure there is an official document. This step was kind of shocking because I have previously coded a few cases where the only information available are from news sources or journal entries. The official information of the case may be difficult to obtain because the records may be sealed, the case may be a state case so the information is not easily available, or the case is very recent (happened in the past ~12 months) and processes like appeals or additional sentencing are still in process.
While going through my assigned folder, it was fascinating how despite the new information that gets released over time our codes remain the same. Even in cases where the official documents get added to our files such as the Randy Linn case, the case is still “correct” based on the guidelines set forth by the project. Some of the minor details that get updated are things like middle names or the case is added to a group case. You might be wondering why the members don’t wait for the official documents to be released in order to fully code a case. It may seem time-consuming and redundant to look for more files after a case is already in the database.
The way the information-gathering process is set up to deal with large amounts of data. This process is best described by scholars like Jensen (2018) as scraping information from other databases, governmental files (such as the FBI), and social media sites like Twitter. Then, the scraped information serves as our “supplementary data sources” and the hunting for official documents begins. These supplementary data sources are key to coding a case because news sources or social media websites may give one insight about the “why” of the offender which is left out of indictments and charging documents. Therefore, the news articles and journal entries written about an offender can be more meaningful to the project than the official documents. However, the gathering of the official documents is still an important task since it serves as confirmation for sentencing and charging details.
Throughout this activity I have learned the value of different information sources and how they each have their own impact on the project. It really helped me realize that often not all the information that you would like is available for one particular case. Therefore, it is important to monitor certain cases so when official documents or more details are released one is able to retrieve those details.
References:
Jensen, Michael J. “Big Data: Methods for Collection and Analysis.” In Theory and Methods in Political Science, edited by Vivian Lowndes, David Marsh, and Gerry Stoker, 4th ed., 306-20. London, UK: Palgrave Macmillan UK, 2018.