Getting Started with Text & Data Mining

Text & data mining (TDM) is quickly becoming a popular tool in STEM programs, but how do you get started? This blog post will help guide you with a general overview of TDM and provide links to open access data, subscription data resources, tools that will help you to analyze and visualize data and give suggestions to help you get organized.

What is Text & Data Mining?

Text and data mining (TDM) is a broad term to describe the computational automated process of gathering, selecting and analyzing large amounts of digital text or data in order to identify patterns, connections, analyze language usage, using large bodies of text such as websites, newspaper articles and social media (Springer Nature, 2022).

Publications on the interwebs have grown to massive proportions over the years which makes the information burdensome to analyze and make sense of information using manual methods. Using TDM approaches can help scholars analyze patterns and information found in open and subscription based resources. 

Who Can Use Text & Data Mining?

While it would seem that TDM would be most popular with STEM subjects it is also a useful tool for those examining topics in the arts, humanities and social sciences.

Open Data Sources

Data sources can be tricky to locate if you are looking out on the open web, but there are open options available to researchers wishing to use a well-established corpus of textual data. Below are a list of some data sources that are fairly easy to start with when examining multi-disciplinary research questions: 

The HathiTrust Research Center (HTRC) provides a platform and training resources to support computational analysis of works in the HathiTrust Digital Library (HTDL) for educational purposes. Check with your institution’s library to find out if they are a member of HathiTrust.

Zenodo is a large open source collaborative repository of datasets and code. Scholars will often share datasets here gathered from social media platforms.

Documenting the Now Tweet Catalog is a collectively-curated catalog of Tweet datasets with particular focus on social justice movements.

Chronicling America (LoC) is a collection of digitized U.S. Newspapers from 1777-1963. Text-mining supported via API access.

The New York Times offers access to a large portion of its content via an API. However, you will need to set up an account and have some Python scripting skills to effectively work with NYT data.

Kaggle is a community of data science students and scholars and a repository of a wide variety of shared datasets.

Subscription-Based Data Sources

ProQuest’s TDM Studio is a great resource for researchers who wish to analyze large volumes of text as found in the existing content such as current and historical newspapers, dissertations and thesis, scholarly articles and primary sources from Proquest across science, technology, medical, public policy and literature topics. 

You can also incorporate your own datasets into TDM Studio. This may include data from open institutional repositories, social media, blogs.

TDM Studio provides flexibility for researchers who are new or experienced working with text and data mining projects by giving you options for data analysis and visualizations. More experienced users may wish to use the Workbench Dashboard so that they may write code to organize, analyze and visualize data by using open-source software such as Jupyter Notebook, R Studio, and Python, while users new to text and data mining may use the Visualization Dashboard which allows the researcher to find content and create visualizations without needing to know how to code.

Planning & Organizing Your Project

Much like when planning any other research endeavor, you will set yourself up for success by spending the beginning phase of your work, planning your text and data mining project. 

You may wish to map out your topic using a concept map so that you have a refined research question. Then, consider…

  • Is text and data mining appropriate for your project?
  • What specific issue that you believe can be solved or revealed by using text and data mining?
  • Consider what data sources you will need. Are they open access sources or available through a subscription?
  • What tools or platforms will you be using to collect, analyze and visualize your data? Excel, Jupyter Notebook, R Studio, HathiTrust or TDM Studio?
  • What type of analysis are you conducting? Are you analyzing words through text classification/natural language processing, conducting a sentiment analysis to better understand the positive/negative/neutral responses on a topic found in social media, or are you conducting a topic analysis by examining a subject or theme?

References and Further Readings

Finn, J. (n.d.-a). Research Guides: Text & Data Mining: Getting started. Retrieved October 17, 2022

Finn, J. (n.d.-b). Research Guides: Twitter Data: Working with Twitter Data. Retrieved October 17, 2022

Fortino, A., Zhong, Q., Huang, W. C., & Lowrance, R. (2019). Application of Text Data Mining To STEM Curriculum Selection and Development. 2023 IEEE Integrated STEM Education Conference (ISEC), 354–361.

Kitchen, D. (2015). Project pipeline questionnaire. Humanities Institute.

Springer Nature. (2022). What is TDM? Text and Data Mining at Springer Nature.

Stanford University (2022). Mining Culture Through Text Data. Introduction to Social Data Science

Kimberly Chesebro, STEM Librarian, The Claremont Colleges Library

We welcome your comments and suggestions. If you have a resource that you would like to see highlighted, please leave us a comment.

Leave a comment