Computing Ireland’s Place in the Nineteenth-Century Novel: A Macroanalysis

Author: Matthew L. Jockers (University of Nebraska-Lincoln)


Right now, in 2015, we are in the early stages of articulating, or defining, a new type of literary criticism. It, whatever it may be, goes by various names. For describing this kind of work, many have latched on to the recently minted, newly popular, and annoyingly imprecise term “Digital Humanities.” To my mind, this thing that we have come to call “Digital Humanities” is neither a method nor a discipline, and though it traces its roots to matters of a deeply textual nature, it is not a term uniquely applied to work in literary studies. For this and many other reasons, I generally find the term “Digital Humanities” useless; it defines nothing in particular, and it certainly does not define or describe the kind of literary research in which I am engaged. There have been any number of thoughtful essays written on this subject of defining the Digital Humanities, and Matthew Kirshenbaum’s “What is Digital Humanities and What’s It Doing in English Departments?” is one of the most recent and most informative.[2] Some consensus seems to be gathering around the idea that Digital Humanities—or sometimes the Digital Humanities—is an area of research or practice that is situated on the boundaries between the strictly humanistic and the strictly technical, but that formulation is totally unsatisfying. I confess, I am not at all certain that I know what Digital Humanities is, even while I am quite confident that I am a digital humanist.

I mention this as an important corrective. Some years ago an English department colleague asked me, in all sincerity, “What do I need to do to break into the Digital Humanities?” This is the wrong question. I was at a loss for an answer. I answered with a question: “Do you have any literary questions that would warrant a digital approach?” To put that another way, I think it is critical to begin with the disciplinary questions, not with an approach or a method, and certainly not with the idea of joining the bandwagon. One does not “break into” Digital Humanities; one discovers that certain questions demand certain methodologies. The worst examples I have seen of work claiming to be “Digital Humanities” research are those that simply apply computational tools to some subject or object without a clear, and clearly humanistic, objective. In such work, the focus is too often placed upon the “what” and the “how” with too little emphasis on the outcome, on the new knowledge, on answering the “so what” and “who cares” questions. Shortly after I left Stanford, in 2011, the literary lab that I was involved in began a lecture series titled “So What?” The goal of this series was to address this problem head on, to force us to be quite explicit in addressing why this or that application of the digital mattered and why it led to new insight, to new knowledge, and how it advanced our understanding of the object of study.

I came to use computational tools in my own work in order to help me answer questions of scale, to explore literary history writ large. Though I did not know I was doing Digital Humanities at the time, my computational work began in the late 1990s when I was writing a dissertation about the function of the American West in Irish and Irish-American literary imagination. At that point, I was really just using databases to help organize and filter an enhanced bibliography of five or six hundred Irish-American literary works.[3] I could have done this work using index cards and a lot of tactile manipulation, but there were simpler and more efficient ways to organize the information using technology. Once these records were digitized, I discovered that I could analyze them in very new and very precise ways. More recently, I have come to use computational tools to answer far more complex questions, and these questions are almost always about scale.

Some years ago Franco Moretti coined the now familiar term “distant reading” to describe a new way of thinking about literary material. Some critics thought this was a radical idea; it seemed to challenge our familiar and comfortable critical practice of close reading. I have recently argued in favor of the term “macroanalysis” for describing the computational methods that enable distant-reading[4] Elsewhere, though, I have described my type of literary criticism as “quantitative formalism,” “data-driven criticism,” and, some years back, “computational humanities.” Enough is enough. The label does not matter. The work is what matters, the outcomes and what they tell us about, in my case, literary history are what matters. No one is ever going to look at a piece of research and conclude, “Wow, that is some great Digital Humanities.” Perhaps someone will say, “Wow, that is some great distant reading” or “some great macroanalysis,” but I hope not. Much more interesting, I think, is something like this: “Wow, Jockers claims that a fundamental reason why the works of Jane Austen remain popular today and those of Maria Edgeworth do not is because Edgeworth failed to write about England.” That’s an idea with some meat on it, and I will offer some preliminary, data-driven, macroanalytic, computationally derived evidence to support it in a later part of this essay.



I have been interested in literary geography and the role and function of place in Irish literature since my graduate school days. In 1997 I completed a dissertation that explored representations of Ireland and of the American West in Irish and Irish-American literary imagination.[5] In 2004, I begin geocoding the fictional settings in ~750 works of Irish-American literature, and I later used Google Earth to visualize Irish-American literary productivity as a factor of time and setting.[6]



The movie above is only the most visually appealing part of that research. Each icon seen here is a single book.  It appears on the map in time based on its year of publication and then in place based upon either the book’s fictional setting or the author’s place of residence.[7] Through this work I was able to argue that a few of our assumptions about Irish-American literature needed reconsideration. Among them was Charles Fanning’s observation that Irish-American literature goes through a phase of decline in the period from 1900 to 1930.[8] This geographically inflected research suggested that the decline Fanning observed was primarily an eastern U.S. phenomenon and that, while East Coast Irish-Americans were going silent in the early part of the twentieth century, their counterparts in the west were flourishing.

A couple of years ago, a small group of us in the Stanford Literary Lab—namely Ben Allen, Cameron Blevins, Ryan Heuser, and myself—became interested in literary geography. We began work on a research project to map the many places that appear in a corpus of 3,500 nineteenth-century novels. The project we imagined was considerably bigger and considerably more complicated than my experiments with Irish-American fiction because this work could not be done by hand, or, more accurately, it could not be done quickly and easily by hand. As it turns out, this is not a simple problem to solve computationally either. We faced two primary problems.

First was the problem of identification. How could we use computers to accurately identify all of the place-names in novels? The obvious approach of using gazetteers proved problematic. In one gazetteer we consulted, “Providence” was a place and so was “Hope.” These may be places, yes, but they also denote other things. Simply running a search for all places in the gazetteer that also appeared in the fiction would have yielded far too many false positives, too many places that were not really places at all.

Second was the problem of ambiguity; place names are ambiguous. “Charlotte,” for example, is used only as a first name in our corpus and never as a city. “Florence” is almost always a character but occasionally a city in Italy. “Charlton,” “Denver,” “Albany,” “Hastings,” “Belmont,” “Gresham,” “Wilmington,” and “Woodstock” are all in our corpus, and all more commonly used as last names than as cities. Additionally, place names are often reused in different geographic locations: “Richmond” is both a town in southwest London and a city in Virginia. “Dartmouth” is a city in England, the U.S., Canada, and Australia. “Georgia” is both a U.S. state and a country in the Caucasus.

To tackle the first problem, we scrapped the gazetteer in favor of Named Entity Recognition (NER). NER is a tool, developed by researchers in the field of natural language processing (NLP), that identifies places using a trained statistical model that is sensitive to semantic and syntactic information in a text. NER is not perfect; it sometimes thinks that “Florence” the character is “Florence” the city, so we took the data from the NER process and then employed topic modeling as a means of muting the place-ambiguity problem.

Topic modeling is a complex algorithmic process that begins by assuming that each book in a corpus is composed of a finite set of possible topics.[9] If it helps, you might think about books as if they were plates that authors fill up at a buffet full of topics, and those topics can be places or themes. Imagine if a youthful Maria Edgeworth and a young Herman Melville wandered into a topic modeling buffet, but instead of choosing to fill their plates with peas and fish as they might at a regular buffet, they chose the settings for their novels. Melville might choose a scoop of New England and a smidgen of South Pacific, Edgeworth a dollop of Dublin and a ladle’s worth of Longsford. The topic modeling process has a way of identifying and measuring these textual features as they exist on the individual plates and within the buffet as a whole.[10]

Less metaphorically, we hand the machine a corpus of books, and we tell it to find some set number of topics. We ask it to figure out what is on the buffet menu based on what is found on each of the plates individually. How we set this number of topics to discover is a nuance I won't get into here. But for the sake of example, let us say we set the number of topics at 100. So the machine looks at all the books as a whole and at each book individually, and for each book it looks at each word in the context of all the other words in the book and in the corpus and the mathematics is somewhat dizzying really. After several hours and, as my friend David Mimno likes to say, “the computational magic of one billion iterations,” the machine spits out a set of word clusters that it has identified as being semantically or contextually similar. The machine does not, of course, know anything about the words’ meanings; it just knows that statistically these words seem to be used in similar ways throughout the whole corpus. In this way the topic modeling process realizes, or operationalizes, what the linguist Rupert Firth was thinking of in 1957 when he wrote of how we shall come to know the meaning of a word by the company it keeps.[11]

The word “bank” is sometimes used as an example of how topic modeling can aid in word sense disambiguation. [12] Occasionally the word “bank” refers to a financial institution and occasionally to something you might stand on when fishing. The topic model has a way of disambiguating these two meanings by identifying words that appear in the context of the different occurrences of “bank.” So, for example, when the word “bank” co-occurs with “money,” “loan,” “finance,” and “deposit,” we understand that the use of “bank” is with regards to a financial institution. When “bank” appears with “river,” “stream,” and “fish,” it is all but certainly referencing the edge of a flowing body of water.

In work I had done for Macroanalysis, I had found I could extract thematic elements from a corpus of novels by applying topic modeling to just the nouns of the books.[13] In order to do that, I began by part-of-speech tagging all of the texts in my corpus using a standard part-of-speech (POS) tagger.[14] Once tagged, I wrote a script to remove all of the words that were not nouns prior to running the topic model. Figures 1 and 2 show two example topics returned in this process.


Figure 1. Wilderness and Frontier



Figure 2. Tenants and Landlords


The topic modeling process only generates the word clusters; it does not generate the labels. I have given these topics the labels “Wilderness and Frontier” and “Tenants and Landlords” based on my subjective analysis of the words in the clusters. Nor does the model produce these word-cloud visualizations. Rather, the model assigns each word a weight, and I have used those weights to generate the clouds seen in these figures. The clouds provide an easy way of assessing or identifying the main theme in the cluster: words that are more central and frequent appear larger in the cloud than the still relevant but less frequently co-occurring words that take the outer positions around the central terms. This method of topic modeling nouns proved effective for identifying the 500 major thematic threads that I discuss in Macroanalysis.[15]

It turns out that a similar approach can be used to effectively identify and disambiguate place names. This method works because when a writer sets a book in a particular place that writer tends to mention other places that are geographically related. An author writing about Ireland, for example, might mention Dublin, Galway, and Belfast. Another author writing about Belfast, Maine, is not likely to also mention Dublin and Galway, and so the machine learns that Belfast, Ireland is different from Belfast, Maine. Figures 3-8 present six “place clusters” taken from the 100 that I harvested in this research. These six, geographically-oriented clusters, demonstrate that topic modeling can be used as an effective way to identify collocated place names and thereby aid in resolving the problem of place-name ambiguity by giving us a general sense of the regions being referenced in the corpus.


Figure 3. France



Figure 4. Africa



Figure 5. America



Figure 6. New England



Figure 7. Ireland



Figure 8. Germany


Over the course of this research, I found that I needed to broaden my definition of “place” to focus less on a specific set of geographic coordinates, or a specific address (such as 7 Eccles, Dublin) and more on a general sense of place—for example, “Dublin and environs.” Rather than developing a method for placing pins on a map, we created a system for extracting the essence of a place or region. In retrospect, of course, this all seems very appropriate in the context of place as a literary construction. Joyce was not, after all, trying to create a book on the geography of Dublin in Ulysses; he wanted to capture the essence of the place. And so, as it happens, the places this method identified did not necessarily have to be places you could find on a map: “heaven” and “hell” as well as “Mars,” “Venus,” and “Earth” found their way into the model. More important than these few outliers, however, the technique provided a tractable means of disambiguating different places with the same names: “Georgia,” for example, could appear inside a topic about the American South, as well as in a topic of countries in Europe (see figures 9 and 10).


Figure 9. Eastern Europe




Figure 10. The American South


Given the way the model works, I can be confident that instances of Georgia labeled by the topic model as occurring in the American South (figure 10) cluster are in fact mentions of the state, and those in the other topic are mentions of the country in Eurasia (figure 9).

You may have noticed the appearance of several character names in these figures: “Frederic” in figure 8 for example. Occasionally character names like “Frederic” slip into an otherwise toponymic cluster. In general, these names are not frequent enough across the entire corpus to corrupt the general sense of place conveyed in the cluster. Indeed, even an aberration that is fairly common throughout this corpus tends to get muted by this method. Notice the word “Illinois,” for example, in figure 7. “Illinois” shows up in this cluster, just under the “n” in the head word “Ireland.” Rest assured, Illinois is not a town in Ulster! It appears in this cluster because the corpus contains several hundred books that were originally digitized by the University of Illinois. These books contain bits of boilerplate metadata that we had not effectively extracted before running this model and preparing these images.[16] Nevertheless, the error is trivial within the larger context. The preponderance of Ireland-related words overpowers and mutes the single outlier word, “Illinois.” Clearly this cluster is about Ireland and not Illinois.

You may also notice the word “Westminster” to the southwest of “Illinois,” Unlike “Illinois,” “Westminster” is not here by mistake. When authors in this corpus are writing about Ireland they often find occasion to mention places in England or on the continent. You'll also find “Germany” located to the southeast of “Illinois” in the word cloud. Like “Illinois,” however, these words are not frequent enough in the context of the many other words of clearly Irish origins, so they do not sway us from the conclusion that what is being represented in this cluster is in fact the country we know as Ireland.

Even with these obvious exceptions, I think we can agree that the tool has successfully identified a cluster of words that captures the essence of the place that we know as Ireland. Irish cities and regions—Dublin, Munster, Cork, Tipperary, Ulster, and so on—all surround the headword, “Ireland.” Alongside these are smaller but obviously Irish localities: Galway Bay, Wicklow, and so on. It is an imperfect method for identifying pure distinct places, but it is not at all ineffectual or unproductive in the identification of place as defined more broadly and thought of in terms of literary representations of place rather than pure geography. In this sense it is a very bad method for geographers and a very good method for literary scholars interested in depictions of place on a broad scale.

Not only does the topic model help to give us a sense of the dominant places in the corpus as a whole, the model also provides us with a book-by-book measure of the proportion, or percentage, of each place in each text. As we might expect, the model returned proportional data indicating that the Irish authors in the corpus are far more likely to write about Ireland than their English or American peers. In the novel Castle Rackrent, for example, Ireland is recognized as being the most prominent of the 100 places the model identified. In fact, the Ireland cluster is assigned a score of 68%, meaning roughly, that when Maria Edgeworth was at the topic-modeling buffet she selected a portion of Ireland that covered 68% of her plate (see figure 11).


Figure 11. Places in Castle Rackrent


This proportion is not at all surprising given that Castle Rackrent is the prototypical regional novel of the nineteenth century and a book that is considered by many to be the first national tale—so, not surprising, but a solid affirmation that the model is working.

These proportional scores can be used as a highly effective search mechanism. If we wish to know which books in the corpus are most interested in representing Ireland, we can sort the data according to these proportions. Doing so for this cluster of words relating to Ireland returns books by well-known Irish authors including Maria Edgeworth, William Carleton, Lady Morgan, and Gerald Griffin. But this kind of data can also be used to plot literary attention to place over time; figure 12 shows attention to Ireland broken down by author nationality. The y-axis is a measure of the yearly mean proportion of the Ireland cluster found in works published in a given year. In 1839, for example, about 14% of the places being represented in Irish-authored books were Ireland.


Figure 12. Attention to Ireland by author nationality


One cannot help but look at the decline in attention to Ireland from 1840 and 1855 and wonder if this is precisely because of the Irish Famine. Perhaps literary attention to Ireland waned in the wake of this national tragedy? I have not pursued this hunch, but I wanted to take the opportunity to suggest how these graphs might be historically contextualized. Such contextualization can make for enticing points of entry into the data and, conversely, such data can be a driver of criticism and interpretation.

Before going deeper into a discussion of the data for Ireland, I will mention that I have produced similar data for ninety-nine other places. Scholars with an interest in literary representations of Spain (figure 13), of Nordic Europe (figure 14), or Switzerland (figure 15), to name just three, might benefit from this approach.


Figure 13. Spain



Figure 14. Nordic Europe



Figure 15. Switzerland


Not every place cluster in the model, however, represents a large region as these above, some can be very specific and nuanced.  My collaborators on this project, for example, have been especially interested in exploring different representations of London in this corpus. Figures 16 and 17 present two nuanced visions of London: “London” (figure 16) appearing in the context of “Portland Place,” “Hyde Park,” “Soho,” and “Temple Bar;” and “London” (figure 17) as seen in the larger context of “Manchester,” an industrial town outside of it.


Figure 16. London and its wealthy environs



Figure 17. Manchester and London


To get the full sense of how Ireland is being portrayed within the texts in this corpus would seem to require closer inspection, and at the scale of a few books we could conduct such inspection by close reading. We could, as I said a moment ago, use the data harvested from the model as a search mechanism that would allow us to find those books in the corpus that are most interested in representing Ireland. Once identified, we could go read those books. As a method for gaining a fuller sense of how a place is represented in a handful of titles, close reading of this sort makes sense. As a way of thinking about how Ireland is represented across the entire corpus, however, close reading is just not plausible. 

We can get part of the way toward our larger objective by leveraging what we know about the authors and the books in the corpus, that is, by leveraging the metadata. Using book-level metadata, for example, I can investigate which authors are prone toward representing which places, and I can explore the extent to which specific places are frequented by specific types of authors. In this corpus, for example, male authors are far more likely to write about Mexico than are female authors (see figure 18), whereas women show a slightly stronger preference for writing novels that feature Paris (see figure 19). 


Figure 18. Attention to Mexico graphed by time and author gender



Figure 19. Attention to Paris graphed by time and author gender


Unsurprisingly, British authors are far more likely to write about London (and other major European cities) than are American authors. There is, however, one exception to this observation: Americans write more about Madrid and Spain than do the British authors. When the Americans write about Spain, however, they tend to do so in the context of places that are closer to home, namely Cuba, Granada, Puerto Rico, and the Caribbean.

Analysis of this type of external metadata—information about gender, nationality, time period, and so on—can be extremely productive to explore, and while such investigations may teach us something about the geographical preferences or tendencies of male and female authors, or about the preferences of authors from America versus England, in the end they tell us very little about the manner in which these authors actually represent the places. In other words, we may know now that the male authors in this collection are far more likely than the females authors to write about Mexico, but we do not yet know in what manner or how they write about and represent that place. What do we do, then, if we really wish to study the representation of place on a broad scale, to ask not “How is Ireland depicted by the Irish author Maria Edgeworth?” but “How is Ireland depicted in the nineteenth-century Irish novel?” Or, to use another example, when authors are writing about slavery in the American South, what are the dominant attitudes and sentiments that get expressed? And, even more important for literary history, how, if at all, do these representations change over time, across author genders, and across author nationalities (see figure 20)?


Figure 20. Linking place with theme and sentiment


With all this as a background, let me shift now in the direction of linking what I have described about representations of place with information from prior research on theme and with some new data about sentiment or emotion. In figure 20, I introduce a third element of analysis. So far I have discussed places and themes, and now I introduce emotional valence or what is generally referred to as sentiment in the language of Computational Linguistics and Natural Language Processing. Using techniques similar to those described above, I generated a set of 25 sentiments, or clusters of words representing expressions of different emotions. For this work, I drew on research in the fields of sentiment analysis and opinion mining in order to develop a method of scoring the clusters in terms of their emotional valence: in my system, each cluster is scored on a scale from -1 to +1, where -1 is a very negative perspective (in the figure below, these are colored in red) and +1 is a very positive sentiment (colored green; yellow clusters are neutral). The details of this scoring method are beyond the scope of this essay, but figure 21 shows five more sentiment clusters from the total of 25 in order to give you a sense of the data. 


Figure 21. Five sentiment clusters


In figure 20 you saw a negative sentiment characterized by the large headwords of “guilt” and “cruelty.” In figure 21 you can see five more clusters conveying or expressing that which is beautiful (upper left), that which is amiable (upper right), that which is wretched (lower left), that which is dreadful (lower right), and, finally, that which is temperate (middle). Just like the thematic and geographic clusters seen previously, the presence of these sentiments is measured across the corpus and across every book in the corpus. So, for example, the data indicate that the book in my collection with the most sustained expressions of guilt and cruelty is an epistolary novel, appropriately titled The Victims of Society, by the Irish novelist, Marguerite Gardiner. Figure 22 provides a linear rendering of the amount of this guilt and cruelty sentiment topic that finds expression from the beginning to the end of the novel. 


Figure 22. Guilt and cruelty in Victims of Society


Figure 23, on the other hand, offers a measure of overall sentiment, either positive or negative, across the entire book.


Figure 23. Positive and negative valence in Victims of Society (dark line is a moving average)


In figure 24 the two measures are superimposed, so that you can easily see how the emotional valence dips into the lowest regions at exactly the moment in the novel when expressions of guilt and cruelty are at the most concentrated.


Figure 24. Guilt and cruelty with emotional valence.


By simultaneously analyzing the data about sentiment alongside the data for place and the data for theme, I can begin to explore and identify areas of the corpus and specific books in the corpus where the appearance of certain places is correlated (statistically) with the appearance of particular themes and/or particular attitudes. Unsurprisingly, the theme most closely associated with Ireland, is the theme that I have already mentioned and labeled as “Tenants and Landlords” (see figure 2). I say unsurprisingly because the history of Ireland in the nineteenth century is largely a history of tensions between the landed and tenant classes. What was interesting about these correlations, however, was the way they varied, sometimes dramatically, depending on author nationality. So, for example, in novels by Irish authors, the tenants and landlords theme was closely correlated with representations of Ireland; or, put another way, when Irish authors write about tenants and landlords, they also tend to be writing about Ireland. When only English authors are observed, however, the correlation drops to a level indicating no correlation whatsoever. What this means then is that it is the Irish authors and not the English authors in this corpus who are fixating on representing the tenant and landlord situation within a specifically Irish context. When writing about Ireland, the English authors in this corpus tend to avoid mentioning the tenant and landlord situation. This does not mean that English authors did not discuss tenants and landlords, only that they did not tend to do so in the context of discussions of Ireland. So if the English were not writing about Ireland in terms of landholding relationships, what were they most likely to be referencing when they were writing of Ireland?

With data such as this, we may begin to piece together a macroscale picture of how Ireland was represented in nineteenth-century literature, at least in this corpus of roughly 3,500 books. When Irish authors are writing about their home, they often mention the tenant and landlord business, but they also frequently discuss friends, family, and neighbors. When they do the latter, dominant sentiments tend to be positive. On the other hand, when the English write about Ireland, the most highly correlated themes include depictions of peasant dwellings and victories in war, and, as you might expect, the sentiments are of superiority and wretchedness respectively. When I presented this material to some Irish Studies colleagues in Milwaukee, Tim McMahon pointed out how these themes and sentiments are representing the English version of the tenant and landlord theme.

In addition to helping us identify and track over time and across authors the positive correlations between theme, place, and sentiment, this method also allows us to identify factors that are negatively correlated. So I may ask, “When writing of Ireland, what themes are English authors avoiding?” Topping the list as the most negatively associated variable (out of 525) is a cluster of sentiment words in which “true,” “real,” “free,” and “natural” are dominant terms. The second most negatively associated theme is one featuring words that allow for expression of mistakes and fault (see figure 25). I will leave it to readers to make the leap from these observations to interpretation. I will simply note that in this corpus, these are the two topics that English authors are most likely to avoid in the context of works that are also representing Ireland.


Figure 25. Mistakes and fault



It is tempting to move from these preliminary observations to deeper interpretations about the specific books in the corpus or the specific writers, and this movement between scales is precisely what I advocate in Macroanalysis. Instead of rehearsing those arguments, I will conclude with a few observations regarding this type of experimental, quantitative literary study and then offer a few thoughts on Jane Austen and Maria Edgeworth.

First, a warning: we need to be ever mindful that the macroanalysis I describe only reveals larger, general tendencies. It is not always the case that English authors represent the Irish as conquered peasants. Nor is it the case that the Irish always represent their own country as a place of friendly neighborhoods. The most we can say is that in this corpus of nineteenth-century novels these authors do so more often than less often. We must use just as much caution when moving from distant readings to specific conclusions as we do when moving from close readings to general theories of literary history. And here, I think, we may learn something useful from colleagues in Economics.

Macroeconomics may tell us, for example, that the unemployment rate decreased in the previous quarter. We may know anecdotally that in that same quarter our next-door neighbor lost his job. The news from the macroscale is, therefore, only generally good. It represents a trend in the larger economy that tells us something about the health of the overall economy, but it also provides us with a way of contextualizing our own individual experiences within that economy. The fact that the unemployment rate is decreasing across the nation at the same time that our neighbor loses his job provides important context. We might react to this news positively: if you are going to lose your job, it may be better to do so at a time when the job market is expanding. Alternatively, this news might make us wonder more deeply about the circumstances under which our friend lost his job. And, of course, this warning can be formulated in more positive terms as well. We can imagine a scenario in which our neighbor gets a job at precisely a time when the larger job market is contracting, and this would naturally lead us to think very highly of our neighbor’s skills; he was, after all, able to secure a job at a time when the market was particularly tough. A similar case must be made for studies of literary history. We would not expect an economist to generate a sound theory about unemployment based on the behavior of workers in her neighborhood, and I do not think we can generate sound theories of literary history by studying only a few books. We need to read books in context, and the approach I advocate offers a method for such contextualization.

In chapter nine of Macroanalsysis, which is in many ways the most speculative and risky chapter in the book, I write about my use of thematic and stylistic similarity as a proxy for literary influence. Among other things, I observe that Jane Austen and Sir Walter Scott possessed elements in their fiction that seem to get repeated into the future. They are like two pebbles dropping into a pond of prose; the ripples from those pebbles reverberate through time. These two authors appear to be influential in terms of the direction that literature takes. Or, thought of another way, they appear to have found early on what becomes popular later on. The same, however, cannot be said of the Irish novelist Maria Edgeworth; but why not?

As a student of Irish literature, it has been brought to my attention more than once that Maria Edgeworth was a writer greatly admired by and influential to both Sir Walter Scott and Jane Austen. In the preface to Waverley, for example, Scott writes of how he hoped to attempt “for [his] own country, [something] of the same kind with that which Miss Edgeworth so fortunately achieved for Ireland.”[17] In Northanger Abbey, Austen references Edgeworth’s novel Belinda as a “work in which the greatest powers of the mind are displayed, in which the most thorough knowledge of human nature, the happiest delineation of its varieties, the liveliest effusions of wit and humour, are conveyed to the world in the best-chosen language.”[18] Austen so admired the Irish author that she presented her with a special edition of Emma.[19]

At the time—that is, the early nineteenth century—Edgeworth was the leading female novelist of her day. Not only was she well-respected in literary circles (she was friendly with Scott, Wordsworth, Byron, Bentham, and others), but she was also well-paid for her work—extremely well paid. In 1814—the year that Austen published Mansfield Park—Edgeworth earned £2000 for her novel Patronage.[20] To put this into perspective, consider that Jane Austen netted a total of £684 13s for the four novels she published during her lifetime.[21] These facts beg the question: why is Hollywood still making movies based on Austen novels? The whispered answer may be “Because Austen was a better writer.” That may of course be true, and we might be able to argue that point convincingly, though our evidence would surely be anecdotal and highly subjective. But what might a macroscale contextualization reveal?

To see if I could discover anything special about Austen and Edgeworth through large-scale contextualization, I began by looking for the dominant themes, settings, and sentiments that find expression in their novels. It turns out that they have a lot in common, but not everything. Once the dominant elements were identified, I charted their survivability over time, across the entire corpus. I began by isolating the two most dominant themes, the two most dominant settings, and the two most dominant sentiments within each writer’s corpus. Once identified, I aggregated these into a single measure: call it the “Austen profile” and the “Edgeworth profile.” Remember that these measures are derived from the book-to-book topic model proportions; essentially, I am looking at each novel, by each author, as if it were a buffet plate, so there is an Emma plate, a Persuasion plate, a Castle Rackrent plate, an Absentee plate, and so on. Once I have the calculated proportions of each topic in each book, I can calculate average proportion across all the books for each author. The average values can then be sorted in decreasing order so as to identify the features (themes, settings, and sentiments) with the highest overall use within each author’s oeuvre.

With the dominant features from each author identified, I could then plot how the Austen Profile and the Edgeworth Profile get echoed across that pond of prose. Figure 26 shows a chronological plotting of the Edgeworth profile, and what is seen is a fairly obvious downward trending. Edgeworth makes an initial splash, but then the ripples die out fairly steadily. The tendency of future writers to employ the same features that are dominant in Edgeworth declines. Figure 27 shows the Austen profile.



Figure 26. Edgeworth profile



Figure 27. Austen profile


Literary history was not kind to Edgeworth, and when I filtered the data informing these graphs, I discovered that the fundamental difference between the two profiles was geographic. The main difference between the two authors was their attention to place. Both Edgeworth and Austen employ similar sentiments and they write along similar themes. What is different about the two writers, at least in terms of the metrics employed here, is their use of geographic setting: Edgeworth’s focus is Ireland and Austen’s is England. In the end, one cannot help but wonder if Edgeworth’s legacy might have been different had she chosen to write less about Ireland and more about England.

[1] This essay is an edited version of a keynote lecture given in October 2013 at the Midwest regional meeting of the American Conference for Irish Studies (ACIS). I think it is fair to say that in ACIS circles I have a reputation for being the “techie guy.” Years ago, while serving as the organization’s webmaster and, then later, as Secretary, I spearheaded a series of technologically driven innovations including online voting and online membership renewals. When I was invited to give this lecture, the organizing committee hoped that I would explore the intersections of Irish Studies and Digital Humanities. This essay is, therefore, about intersections; it is not about grand conclusions. The lecture was written to offer some background on Digital Humanities and introduce a data-driven way of thinking about literary history and literary criticism.

[2] Matthew G. Kirshenbaum, “What is Digital Humanities and What’s It Doing in English Departments,” ADE Bulletin 150 (2010): 55-61,

[3] The bibliography was enhanced, so to speak, by the addition of various types of metadata about the works, metadata—things such as author gender, fictional setting, and other information pertaining to the texts under analysis—that I had collected as part of my research.

[4] See Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (Urbana-Champaign: University of Illinois Press, 2013).

[5] Matthew L. Jockers, In Search of Tir-Na-Nog: Irish and Irish-American Literature in the West (PhD diss., Southern Illinois University, 1997).

[6] This work was presented at the 2008 MLA convention and featured in Jennifer Howard’s story “Literary Geospaces” in the Chronicle of Higher Education. See Howard, “Literary Geospaces,” Chronicle of Higher Education, August1, 2008,

[7] The business of selecting a geographic local for each book is sometimes complicated. Kathleen Norris, for example, was born and raised in San Francisco and set many of her novels in California. After marrying Charles Norris, she moved to New York, and, to complicate matters more, some of her works are set on the East Coast. I nevertheless code her as a western, California writer given the deep affinity she had for her birthplace, as is evidenced in her autobiography, Family Gathering, and in her separate essays,  “My California” and “My San Francisco.”

[8] Charles Fanning describes this period in Irish-American literary history as a “lost generation.” See Charles Fanning, The Irish Voice In America 250 years of Irish-American Fiction (Lexington: University Press of Kentucky; 2nd edition, 1999).

[9] The technical term for the type of topic modeling applied in this work is “latent Dirichlet allocation,” or simply LDA. See David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent Dirichlet Allocation,” in Journal of Machine Learning Research 3 (January 2003): 993–102.

[10] I have written a layman’s introduction to topic modeling on my blog. See “The LDA Buffet is Now Open; or, Latent Dirichlet Allocation for English Majors” at

[11] J. R. Firth, “A Synopsis of Linguistic Theory, 1930–1955,” in Studies in Linguistic Analysis (Oxford: Blackwell, 1957), 179.

[12] This example is borrowed from Mark Steyvers and  Tom Griffiths, “Probabilistic Topic Models” in Handbook of Latent Semantic Analysis, eds. Thomas K. Landauer, Danielle S. McNamara, Simon Dennis, and Walter Kintsch (New York: Routledge, 2011), 427-448.

[13] In addition to chapter eight of Macroanalysis, also see where I have described my process and cataloged the results.

[14] A POS tagger uses a trained statistical model to identify and “tag” all the words in a document with their part of speech. These models are not perfect and accuracy varies depending upon the genre of text being tagged. I used the Stanford log-linear part-of-speech tagger: see Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer, “Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network,” in Proceedings of HLT-NAACL 2003 (Stroudsburg, PA: Association for Computational Linguistics, 2003), 252-259,

[15] See also Matthew L. Jockers, “500 Themes from a Corpus of 19th-Century Fiction,”

[16] I have subsequently improved the pre-processing routines to correct this problem, but it makes for a useful lesson here.

[17] Sir Walter Scott, “General Preface (1929),” Waverley, or ‘Tis Sixty Years Since (Oxford: Oxford UP, 1986) 352-53.

[18] Jane Austen, Northanger Abbey (Boston: Little, Brown and Company, 1903), 36.

[19] According to the website, “the first edition copy of Emma in 3 volumes, that was presented to Maria Edgeworth in Ireland by Jane Austen through the offices of her publisher, John Murray” was sold at auction by Southeby’s in December 2010. See “Maria Edgeworth’s ‘Emma’ and Edward Knight’s Wedgewood China for sale…,”

[20] John Mullan, “The Truth About Women,” The Guardian (February 7, 2010),

[21] This figure is reported in James Heldman’s “How Wealthy is Mr. Darcy—Really? Pounds and Dollars in the World of Pride and Prejudice,Persuasions 12 (1990), 39-48. See: