Text Analysis Praxis Assignment

I chose to experiment with Voyant and word tree for the text analysis praxis. I wanted to find a text that I was familiar with so that I could at least kind of understand what I was looking at. I went to Project Gutenberg to see if I could find a text that was available that I was familiar with. I was excited to find The Well of Loneliness by Radclyffe Hall. Project Gutenberg is an awesome resource that I enjoyed exploring and hope to continue to use in the future.

The first tool I used was Voyant. I liked the various charts that this tool produced and the ease of viewing/playing with the different visualizations.

Voyant – The Well of Loneliness

I didn’t think it was all that illuminating to see that “like”, “little”, and “said” were among the most used words. I was also unsure of how to read or manipulate the data in any meaningful way based on the initial output. But, once I played around a little bit things got more interesting. I decided that I would look at the main characters love interests throughout the novel and compare that to instances of of the word “longing”. I wanted to use the words “longing”, “longed”, “lonely”, but I could not figure out how to make them all into one category. I think if I had played around with the tool more I would have been able to figure that out. Based on the data “longing” was the word in this cluster used the most throughout the book, so I chose to use that word for my analysis.

Briefly, the novel follows Stephen, a lesbian in early 20th century England. Collins is her tutor (childhood), Angela a friend who she has a relationship with (adolescent/early 20s), and then Mary who is arguably the love of her life (adulthood). I think it’s interesting that Stephen’s feelings of longing are heightened when she is in a relationship. In the novel Stephen is obviously queer, she wears “mens” clothes, doesn’t marry, does traditionally masculine activites, etc. Whether it’s a symptom of the time or genuine attraction, Stephen dates feminine women who are often betrothed to men (Angela) or they face discrimination/a harder life, which prompts Stephen to push them away and into the arms of a man (Mary). I think you could extrapolate that these factors influence her feelings of longing.

The other tool I explored for this praxis was word tree. I really liked the interface and how the user interacted with the text. It was useful to have the full quote highlighted on the side of the page, which I think would be really useful for performing close readings of texts. This tool also seemed to capture the overall themes of the novel better than the Voyant analysis. As a fun little treat the results also read like poetry to me.

word tree – The Well of Loneliness

PRAXIS: Text Analysis ACT UP

I used Voyant to analyze the following online texts/sites:

The New York Public Library Digital Collections :Act Up New York Records

National Aids Memorial: History of the Quilt

New York City Aids Memorial : HIV/AIDS Timeline

Britannica: ACT UP article

Very easy to use and create visualizations such as cirrus, termsberry, word links, and word trends. It was fun to see similarities between texts and sites very quickly. I thought I would see certain words regarding race, sexuality, more healthcare terms like illness, drugs, certain medical professions, to be more common but wasn’t the case. Also, different entities, like the FDA, government bodies, maybe other activist groups to have come up as well. Furthermore, it was interesting to not see common names of people who were pivotal to the movement.

I was trying to see how I can link terms together, for example “new” and “york” to be linked as a phrase instead of individual words. In the termsberry, cirrus, and trends you see the words separately.

In all, I enjoyed using this tool very easy to add in more sites, to hover over the tools and see the information. I would say the tool is very useful, insightful, and efficient in producing the data.

While continuing to play with this tool, I saw the feature of scatterplot, in output you can select options of different unsupervised learning methods like document similarity, PCA Analysis, Component Analysis, and t-SNE. In the graph below you can see how close together points “new” and “york” are, going back to the point of using the phrase new york instead of individual words.

Praxis Post – Using a Digital Activist Corpus in Voyant

I gained experience with text mining in a Corpus Linguistics course; we primarily used AntConc in that class. In my final project, I created my corpus and used AntConc to analyze the context in which journalism/media outlets write about digital activism. The corpus totals 108 articles and 148821 tokens. There are 18 articles from The New York Times, BBC, 2 from The Washington Post, 4 from The Wall Street Journal, 14 from CNN, 1 from the Financial Times, 18 from Al Jazeera, 4 from PBS, 1 from Foreign Policy, 5 from NPR, 37 from The Guardian, 2 from Reuters, and 2 from CNBC. For the in-class exercise, I used Voyant to analyze the same corpus. 

With Voyant, I enjoyed the summary the software provided of the documents; document length, vocab density, avg. words per sentence. Additionally, Voyant has more visual features, making the scale of the words more understandable (whether there is a larger quantity of words or phrase structure that has been repeated). 

Connecting GitHub was convenient and something that AntConc could not do; however, I couldn’t connect or compare my corpus to a reference corpus (which AntConc can do). I could compare the frequency of words found in the texts to one another (ex: “digital activism” vs. “internet”). For basic text mining for someone with less knowledge, Voyant is easy, convenient, and powerful. 

The corpus summary.

The word cloud created by Voyant.

The collocates of the term “digital activism”.

The relative frequency of “internet” vs “digital activism” within the documents of the corpus.

Blogpost: (PRAXIS) Text Analysis of the US Constitution using Voyant for the first time.

My Experience with Voyant.

Before resolving to use Voyant, I initially explored Google N-gram but found it “kinda” difficult to navigate for deeper insights. Voyant, on the other hand, felt much more user-friendly, especially with its collection of very helpful features. The Cirrus tool, which creates a word cloud, stood out immediately. It highlights the most frequent words in a corpus, offering a quick, visual snapshot of key terms. Another useful feature was Terms, which displays the frequency of terms across the document, making it easy to track word usage patterns.

Links, a network diagram tool, was particularly helpful for exploring how words co-occur, offering insight into relationships between key concepts. The Reader view displayed the full text, allowing me to highlight and analyze terms within the document directly. Additionally, TermsBerry, a playful bubble chart, allowed me to visualize word frequency and connections in an engaging manner.

Other features, such as Trends, as well as Context and Bubblelines, added even more depth to the analysis. Voyant also provides statistics such as word counts, vocabulary density, and readability scores, making it not only visually engaging but also a quantitative tool for text analysis. Its ability to generate instant visual feedback and downloadable outputs made it ideal for my praxis.

Analyzing the U.S. Constitution

First, as part of the mining, I searched on Google for a txt. file of The US Constitution, and was able to find THE
CONSTITUTION OF THE UNITED STATES OF AMERICA As Amended
on www.govinfo.gov which I highlighted all, copied, and pasted into the Voyant reader for analysis.

Using Voyant to analyze the U.S. Constitution was an interesting experience. The corpus was a single document, containing 39,243 words and 1,896 unique word forms. Voyant’s summary statistics revealed key insights, such as a vocabulary density of 0.048, indicating high repetition in language, and a readability index of 10.001, suggesting that the text is accessible to a broad audience.

From the Cirrus tool, it was revealed that the most frequent words in the text were terms like “shall” (1,268 occurrences), “states” (592), “congress” (396), “state” (387), and “president” (370). These terms reflect the U.S. Constitution’s focus on governance, authority, and the distribution of power.

The Links tool allowed me to explore how these terms are connected. For example, it was interesting to see how frequently “states” and “congress” appeared together, highlighting their relationship in the text.

The Reader view allowed me to read the full document while tracking specific words, and TermsBerry provided an interactive visualization of word frequency, which made it easy to explore patterns and relationships between terms.

The Trends (which combines line and bar charts for term frequency over time)

In summary, Voyant offered a visually engaging, data-oriented approach to analyzing the U.S. Constitution, making the analysis both colorful, accessible, and insightful. As a prospective Digital Humanist, I will very likely be using it much more in the future.

Kelechi Iwuagwu – (A Data Analysis & Viz Candidate, CUNY Grad Center)

Text-Mining Project: Bell-Bottoms

For this assignment, I went to Google’s NGram Viewer and, first, thought about the data shown by the example graph. This was a comparison of the frequency of appearances of the words “Frankenstein,” “Albert Einstein,” and “Sherlock Holmes” from 1800 through 2020. As there were three lines charted on the same graph, it was easy to compare how mentions of each name rose, fell and intersected.  An interactive vertical line feature showed the names’ changing positions along the X-axis.  It was interesting and surprising to see how the frequency of each name’s mentions in publications varied over 224 years, and I tried to think of cultural and news events that may have coincided with the increases (and decreases). For example, I wondered if the name “Frankenstein” appeared more in certain years because of popular new movies based on or inspired by Mary Shelley’s 1818 story (1931, 1974, 1994 and 2015…the latter with Daniel Radcliffe) or based on news stories relating to genetically modified food (1990s).  

Next, I started my own search and entered the term “bell bottoms,” not really knowing what to expect beyond a surge of mentions of the pants style during the 1960s, which indeed occurred. Prior to that, the most mentions of bell-bottoms occurred during the 1940s. The Google NGram’s book-search feature, which I hadn’t used while exploring the previous graph, was super helpful in providing context for earlier highs and lows.  I learned that, prior to the 1960s, the most popular reference to bell-bottoms was for a style of pants developed for sailors; the wide bottoms meant the sailors could easily roll them up above the knee to prevent them from getting wet. Earlier in the century and further back into the mid- to late 19th century, the name “bell bottoms” appeared less often, but usually referred to a dandyish style of men’s pants. F. Scott Fitzgerald wrote about them in a fashion article (co-written with his wife, Zelda Fitzgerald) for McCall’s magazine in 1925. (I also learned that, mostly prior to 1960, “bell bottom” occasionally described chemical-processing equipment or the shape of certain tree roots.)  

In 1961, Life magazine reported that London designer Mary Quant had created “a line of low-waisted slacks” which, “in British nautical tradition,” featured bell-bottoms; this signaled movement away from the style’s more practical use, and the frequency of mentions quickly rose toward the term’s first significant high point, in 1971. Mentions declined sharply after this, hitting a relative low in 1982. The term began to appear more often beginning in the early 1990s, sometimes describing Halloween costumes, sometimes marking bad fashion choices (in literature and in magazines).  As the decade passed, the term’s appearance began to rise again, when it seems to have become less a signal for a fashion misfire than a symbol for the counterculture or the disco years. Mentions of “bell bottoms” continued to increase, reaching another high point in 1998, when it began to be associated with “vintage” looks worn by young people, funk musicians and skaters. “Bell bottoms” again peaked in frequency, in 2011, and most of these mentions seem to come from literature, fashion history texts, or appear in relation to decade-themed events. 

This exercise was a great way to try out text mining. As with any tool of this kind, of course, human interpretation of the results is needed to provide the best context, although reading the original documents found by the search feature can be exciting.  

Text Analysis Praxis

I focused on Voyant Tools. Initially, I uploaded two different young adult books separately as the texts to analyze and I noticed most of the frequent words used for both were common words like “said”, “just”, “like” or a character’s name. I also noticed that some terms were either highlighted green or red but when I hovered over them it didn’t tell me what the highlighted colors meant, I assumed they were either positive or negative. For example, “dead” or similar iterations of the word, “monster”, “darkness” and “stupid” were highlighted in red. Other terms like “nice”, “hero”, “love” and “magic” were highlighted green.

When doing this assignment, I was curious about how often certain phrases or words were used because of running jokes between each book’s fandom. For Percy Jackson and the Lightning Thief, if I select the top phrases for Annabeth Chase, one of the main characters, I get “Annabeth yelled” which could be because her life and the lives of her friends are usually in danger.

Secondly, I used an article titled Race and Gender by Gebru. The most frequently used words were more varied than in the books, which made sense since the books are narrated in the first person. Gender was used 53 times, and AI was used 50 times. Some of the terms highlighted in red were “issues,” “inequality,” “harassment,” “errors,” and “racist.” Terms like “powerful,” “ethical,” “success,” and “fair” were highlighted green.

It was also interesting to see the results of comparing two documents. Distinctive words between the two get added to the summary section and the trends graph compares how frequently specific terms appear in each document. For example, “archives” and “archival” appear more frequently in Toward slow archives but have fewer appearances in Johnson Markup bodies. On the other hand, in Johnson Markup bodies terms like “black” and “digital” had a higher frequency rate than Toward slow archives.

In general, I think the text analysis provides more informative analyses for academic articles/texts. I did like the overall interface of the tool. There were a lot of features to explore; there was a summary section that included vocab density, readability index, average words per sentence, and most frequent words in the document. Another section called “Reader” featured the document that could be read in real-time and yet another section included a graph that displayed word frequency trends at document segments.

Text Analysis Praxis

For the text analysis praxis assignment, I mostly explored Voyant and the JSTOR text analyzer, Constellate. Voyant was a very self-explanatory and easy to use platform. You can either upload your text or paste it directly in the text box. Then, the tool creates multiple visualizations based on that text. My first step was to choose which text to dive into. I knew I wanted it to be a simple text. Additionally, I wanted to explore something personal to me, that might reveal something pertinent to my life. To this end, I copy and pasted my two personal statements from my grad school applications into Voyant. The statistics revealed that the words used most frequently are data, children, education, and trauma. This is very much in line with my purpose going into grad school, as I am a QMSS student focusing on trauma in education. My favorite visualization was the one where my frequently used words were connected to one another. This allowed me to see not only the main themes, but how they came together to make one cohesive argument. I also noticed Voyant’s use of the word ‘Reveal’ as the button to generate the statistics. This implies there is a truth hidden within the text that must be unveiled. 

Using Constellate, I decided to explore published texts that contain the word ‘gaelic’ to stay with the theme of my previous mapping project. The visualization generated was a graph that showed the amount of texts with that keyword published over time. Furthermore, I could explore word frequencies within that sample of texts. I had a lot of fun with this feature, looking at what concepts were more and less prevalent at certain times in history. For example, both the percentage and total amount of texts published about gaelic in Ireland was much larger than those published about gaelic in Scotland. During this exploration, I found myself thinking of our readings that insist that data is never neutral and requires close reading and interpretation. For example, if you switch the summary metric from percentage to total count, an entirely different story is told. This not only requires transparency from the statistician, but attention and prior knowledge from the audience. 

Reading response: Textuality

From the readings this week, I saw two major themes emerge:

  1. Distant vs. close reading (perhaps as a parallel for structural vs. individual patterns)
    • It’s ironic to me that Jocker makes the claim that close reading has been rendered “totally inappropriate” and then goes on to — at least Mandel argues — recreate or embed a sexist structure within his distant reading analysis.
  2. Something like… negative definition? Defining something by its relation to other things (how like or unlike it is)
    • In So & Roland, “white” and “black texts” defined relative to each other
      • I found it rich in meaning that the absence of the word “white” in a text was a strong indicator that a text was by a white author; whiteness defined by its very absence
    • Comp titles being a means of defining what kind of books are expected to find commercial success
      • A parallel in Mandel’s discussion of the conservatism of statistics and Bayesian probability

Questions I have coming out of the reading:

  1. Mandel criticizes Jocker’s analysis as veering from a study of style instead to mere counting. How does one find the line between such approaches, especially when using quantitative methods?
    • Maybe Mandel argues that there never is a line, it’s always both? (“The biological binary…is a textual and numerical” [production].)
  2. The shift in the So & Roland essay from categorizing texts along a racial binary to instead a spectrum of probability seemed brilliant, and opened up a lot of avenues of inquiry. And Mandel speaks of “animat[ing] numerical processes.” But it feels hard to know how to design projects this way from the outset. Do you find that after some time in the field, digital humanists can kind of “slice and dice” their research questions this way or is it really a matter of experimenting and seeing which way the data leads the research?