Natural Language Processing Survey Data for TSLAC

>Summer 2023
>Texas State Library and Archives Commission
>Team: Self
>Tools: Python, Pandas, Gensim, NLTK, VSCode, Tableau, Canva


TSLAC’s CEC team generates an amount of data that is too large to manually parse, while being too small to automate using NLP effectively. Based on this, the project focused on providing basic topic modeling, while providing suggestions for future survey structures.


Background
The Continuing Education and Consulting (CEC) team at the Texas State Library and Archive (TSLAC) provides online workshops and resources for librarians across Texas and the US. The CEC team had a backlog of exit survey data from these workshops, with no way to analyze it. The aim of this project was to enable the CEC team to more easily analyze the data they have already collected through textual analysis and NLP in Python, while providing recommendations for future data collection and survey structure.
Process
To start my project, I took the Excel files of data and converted them into .csv files to be read by the pandas module in Python. Then, using the Natural Language Processing Toolkit (NLTK) and Gensim, I pre-processed the data and created a corpora dictionary of all the vocabulary used in the surveys. From there, I used Gensim's LDA programs and NLTK's word frequency distribution to determine most common words and generate topics. These results were then exported into multiple .csv files and used to generate the final dashboards.

Outcomes
The visualizations within the dashboard provide the CEC team with information on where their attendees are located and information on survey response rates. The second dashboard displays the topic modeling results as well as the noun/adjective frequency for the words in the corpora. Based on the data analysis and visualization, I made multiple recommendations to TSLAC's CEC team regarding future data collection. I advised the team to prioritize survey questions such as Likert scales and multi-select options, so they can get more recommendations without requiring free response questions. These dashboards paired with the survey recommendations should allow the CEC team to more quickly analyze the data from their exit surveys, while also getting more responses from attendees who are unlikely to answer a free response question.