Asterisk Scholarship: A Dystopic View on Digital Research

I. A Potentially Meaningless Debate

As is well known, the communities involved in scholarly research and publishing are currently embroiled in a conversation (to put it gently; more like open polemic) about 1) open-access publishing, 2) the increasingly dominant role of para-academic companies like JSTOR and Project Muse, and 3) whether the rights to republish scholarship in digital form rest in the hands of the scholars or the publishers. It is not my interest to comment here on these hot-button issues on which I am, regardless, not qualified to say anything that will matter on a grand scale. It’s obvious that money — not just who gets it, but whether scholarly publishing is a money-making enterprise at all — is the primary issue: for a long time, academic publishers have charged high prices for their books and journals, and the new digital versions of these traditional projects, or brand new “born digital” projects, cost just as much, and usually more, than the traditional print versions. There is nothing unexpected here. Small colleges that are too budget-constrained to pay such prices are in the same position they’ve always been: they have to make hard choices about what to buy for their libraries, and the high costs of academic books (and of new digital tools) are directly detrimental to the students and scholars at these small (or large but less wealthy) colleges. As I said, nothing new here.

However, the recent change, it seems to me, is that the ease of digital distribution is a concept with such inertia (and global implications) that the public release of all of the extant scholarly information (both humanities and science included) in digital form is absolutely inevitable. Google has done a lot of the hard work (though not always well; some of Google’s digitizing will need to be redone). Now it’s just a matter of time until people like Aaron Swartz collect all the documents on JSTOR, Project Muse, and the rest of the scholarly repositories and just release it online. And if Google Books gets hacked as well, then all the most important genies will be out of the bottle. This is assuming a major academic player in the Google Books digitization project — one of the universities involved — doesn’t just decide to open the gates and dare the publishers or the Authors Guild to sue them. We’ll get there eventually. Needless to say, from this perspective new endeavors like the Digital Public Library of America are only stopgap measures — accepting the genie is out of the bottle while simultaneously trying to put it back in — and may not even get off the ground before the switch is thrown.


But that is not what I’m interested in discussing here. This debate is only the background on which students and scholars continue daily to do their work on campuses and in libraries throughout the world. For these students and scholars the debate might as well not be going on. They will get the books any way they can get them. This is a key aspect of the scholarly eco-system that I feel is lacking from the public perception of the debate and it is an aspect that the big-name academic commentators seem to be utterly unaware of. PDFs are available everywhere — it takes very few keystrokes to get to sites that have photo-ready PDFs of recent scholarly books from Oxford University Press, Harvard University Press, Wiley, etc. Basically, you name it and you can get it, assuming it’s a relatively recent book (or a high-demand older book). Obviously, these presses have people on their staffs (or in the offices of sub-contractors) that leak the PDFs of recent books: these are print-quality PDFs, in full color, of a level of quality which couldn’t be obtained by scanning the books after publication.[1] If you don’t believe that such books are so easily obtainable, just visit these websites (used here as examples for the sake of argument):


Want a PDF of Anthony Grafton’s new Worlds Made by Words (2011)? Very easy to find on these sites. Want something classic like A.H.M. Jones’s Later Roman Empire (1964)? Also easy. How about the first 50 volumes of the journal Dumbarton Oaks Papers? No problem. Of course, you have to know slight tricks, like how to navigate a sharing site, or how to extract a .rar archive, but the learning curve is very gentle (especially compared to the hoops one might jump through to download a recent film through .torrent files). All of these volumes just mentioned are very much still under copyright — and for good reason since they’re in high demand — but obtaining them couldn’t be easier.

Alternatively, if it’s a less popular book that’s still in copyright (and thus not on the sites just listed or on Google Books), then there are devotees of these subjects in scholarly circles that trade such books constantly. I remember being handed, several years ago, a hard drive full of about 200GB of PDFs of scholarly editions, translations, and studies (monographs and articles) that had been meticulously scanned by a group of graduate students at an Ivy League university. All the students in the group were working on (and were passionately devoted to) a specific area of the humanities. These books and articles were mainly the following:

  1. recent books that were too expensive for graduate students;
  2. important older books that were now out of print but not out of copyright (also usually too expensive, even when available used);
  3. entire series of reference books or primary source collections, often quite rare as well;
  4. reams and reams of articles published in out of the way journals that only a few libraries in the United States carry.

I have done a tremendous amount of academic work (both teaching and research) using that cache of documents. In fact, I know scholars at small colleges who obtained the same cache of documents as I did and have thus far built their scholarly careers on those very same PDFs. It’s like having a massive, specialized research library in your pocket. The scholarly revolution based around this and similar caches of PDFs really is incredible. As I said, students and scholars will obtain the materials necessary for their research however they can. It’s always been the same. We are intellectual mercenaries; copyright means nothing to us.

You may say in response, why not just use the library? If you’re at an Ivy League school, presumably the library itself (Widener, Sterling, Firestone, etc.) is an important perk of having a place at one of these institutions (for students as much as faculty). And obviously all of these specific books were already available since these students were scanning them on site.[2] So, why did this devoted band of graduate students at an Ivy League school go to so much trouble to scan millions of pages of pricey and esoteric books when the books are right there on the shelf? The answer to this is what I’m really interested in discussing in this post. This is where the “dystopia” of my title starts to come in.

III. The Library

Generally speaking, the work of scholarship is now done in front of a computer. This has of course been true for some time in other (more practical or financially lucrative) sectors of “knowledge work”, such as in the legal sector, or among financial researchers, and especially among physicians. But for the humanities, this is a relatively recent phenomenon. In fact, I can clearly remember that at the beginning of my graduate career, say during my M.Phil. degree at Oxford from 1999–2001, I spent the majority of my time dealing with published scholarship in the library, usually by either reading the book or article in situ (in the Bodleian), and taking notes separately (usually on my laptop), or photocopying it (in the Ashmolean, now Sackler, open-stacks library) in order to retain a copy for myself that I could mark up in pen and keep for later. This was how it had been done for decades, since the time (I assume) of the mimeograph copiers in the 1950s and 60s.[3] This was still the age when laptops were primarily used for word processing and email; and I had to dial into the Oxford server via a modem (though my undergraduate institution, Vanderbilt, had ethernet in the dorms by this time). By the last academic year of my D.Phil. at Oxford (2004–5) the concept of the massive online storage of documents (particularly your personal email) was only just coming into the public consciousness: like others, I remember asking everyone I knew if I could poach a Gmail invite: 2GB of online storage seemed like a brave new world which I wanted to be a a part of — we didn’t call it The Cloud then, of course, but we recognized how revolutionary it was.

Around this same time (c.2003–2004), I got to know a colleague at a different graduate school who was doing something that completely floored me: he was digitally photographing every book he used, or thought he might ever use, in the university library he was working in (also Ivy League, incidentally). I remember thinking initially that this was a ridiculous waste of time — he even had his wife doing it for him while he was busy writing (or while photographing something else right next to her). In parallel to that photography project, he had constructed an intricate Microsoft Access database in which he collected all the potential bibliography for his doctoral thesis and for years beyond. To each of the bibliographical entries (articles, books, primary texts, etc.) he linked a local folder on his machine of the JPEG images he had photographed of that specific book (PDFs were not yet standard in this context). He numbered the photographs according to the pages of the book so he could attach specific notes typed into his database with specific pages of the book he had photographed. In other words, there was a failsafe system; if later he couldn’t understand the notes he had taken, he could refer to the image directly.

More importantly, however, my colleague had created a library of a library, a simulacrum that had all of the benefits of the physical library, with none of the disadvantages (being portable, for one), and added the technological advantages of the computer generally, and of the relational database specifically. It wasn’t necessary that he had already taken notes on the text — all that mattered to make a given book or article worthy to be photographed was that he might possibly need to refer to it in the future.

This approach to collecting scholarship — completely radical to my eyes, at the time — could only be really useful under certain circumstances:

  1. it had to be done at a world-class library with lots of hard-to-find items;
  2. the library had to have open stacks, so you could actually get at the books quickly, instead of calling them up to a reading room;
  3. there had to be a liberal photography policy;[4]
  4. and, by nature it relied on the technology of the digital camera, which was by then a readily available, pocket-sized item.

Fast-forward to the aforementioned band of students scanning in a different Ivy League library a few years later (c.2006–2010), and some of the principles remained the same for them, though there were already important differences too. Instead of photographing the books,[5] they scanned the books using expensive overhead scanners[6] or using scanner-photocopiers[7]. Their motivations were something like the following (having been queried in retrospect for the purpose of this post):

  1. that possessing a PDF of the original document gives a kind of intangible security and, practically speaking, means you don’t have to keep going back to the shelf;
  2. more specifically, that one may not always be at a world-class library: it’s like “taking tupperware to an all-you-can-eat buffet”;
  3. and finally, just the egalitarian ideal of “flattening the world” by scanning everything possible; sooner or later, we’re all “have nots”, so it’s better to take from the “haves” while you can (and share it with others).

I would suggest also that, while these ideals were no doubt true for these students, the mere fact that one could do it, technically speaking, was almost a goad unto itself. As I said above, this was really the first time that such massive scanning projects became possible for individuals to do on their own. The university had already paid for the scanners and, generally speaking, these students were some of the only ones using them at that time.

Meanwhile, my colleague who photographs his books had advanced the state of his own art: from 2005 or so (to the present) he began OCRing all of the images he had taken, at least if they were of a standard level of quality (which he now knew to a precise degree). Using the consumer standard ABBYY FineReader for the OCR and various other pieces of corporate-level software to flatten the page images, he produced beautiful, searchable, black-and-white PDFs of thousands of books and articles, often including tables of contents and cover images within each PDF. While this was a lot of effort, the workflow was smooth, and the end product was certainly desirable: tens (even hundreds) of GBs, thousands of books and articles, and (ultimately) millions of pages of pristine scholarly works — all digitized, all searchable, and all organized in his relational database. The scanning graduate students had produced an equal amount of material, but none of it was OCRed, none of it was flattened, despeckeled, straightened, indexed, filed in a bibliography, etc. As I said, there was a pristine quality to the photographed+OCRed PDFs that gave me a sense of what was possible, and even perhaps addicted me to a level of interaction with the scholarship that I had not previously experienced.

IV. Asterisk Scholarship

As I said above, scholarly research today is usually done in front of a computer. As sad as it may seem to a previous generation, many young scholars today would believe (or act as if we believe) that if a piece of scholarship, or even a primary text, doesn’t exist in digital form then it doesn’t exist at all. This trend has been decried by various traditionally trained humanities scholars for over a decade now,[8] but it seems to me to be a descriptive reality. The “mercenary” quality I mentioned above has its own side-effects, but these are not contra-indicators, I would argue, of what is essentially the only way forward.

At the same time, it could be noted that, in the face of the decriers, the value of being able to download a PDF, legally or illegally, to help one’s research does not receive the level of praise in the educated media that it should. There really is a revolution going on. Much of modern scholarship is available in digital form if one is willing to look for it, to ask around for it, or (with relatively little effort) to digitize it oneself. This revolution deserves further publicity and can act as a balance against the traditionalists.

But, on the cognitive level, even if we were to overlook this obvious value, it is impossible, I would argue, to go back to the previous mode of scholarship. In previous generations, this is how research was done:

  1. the scholar would begin by buying a pack of 3x5 notecards and a little box,
  2. would then read through the sources while all the time writing relevant bits of information for his or her project on the notecards,
  3. would simultaneously or subsequently organize these cards within the little box,
  4. would then write magisterial books by moving methodically through these thousand bits of information in the little box toward the goal of the original idea.

The benefit of this approach was its combination of linearity and comprehensiveness. The mode of writing was linear, whereas the collection of information was non-linear and comprehensive. The scholar’s job was to make a narrative out of the disparately collected information. We can imagine that this basic process of research had not changed much since the beginning of written note-taking and continued up to the modern world in its basic ancient form.[9]

But the current shift to digital media has changed this process almost entirely. First, there is an absolute deluge of scholarly information, most of which is in PDF form (sometimes only available as PDF), but which also exists sometimes in plain text form on web sites that amass HTML texts.[10] Second, the thousands of PDFs I noted above are all available at the touch of a button, but how does one sort through them? The digital tools available, whether Zotero, Endnote, or whatever other bibliographic management tool, are inadequate in my experience in dealing with hundreds of GBs of information: they’re either too slow or don’t have a satisfactory mechanism for inputting and searching the data without false positives or errors. Moreover, despite their best efforts, these tools (which have only recently become hubs for searching online databases) can’t come close to uniting the worlds’ information networks under a single banner.

Perhaps most disturbing of all, Zotero, the favorite of the library and open-source communities, seems to have no native strategy for dealing with the proliferation of mobile devices.[11] I remember vividly that during the original Zotero Everywhere announcement in September 2010, one of the first questions asked in the chatroom was “will there be an iPhone/iPad version?”. The flat answer of “no” stood out like a sore thumb and made me feel at the time (and still today) that no one in charge of these digital research tools really understands how to integrate the current deluge of digital scholarly information with the way researchers (i.e. real people) want to interact with the information. It seems at times as if we’re just setting up more roadblocks on the way to solving the problem of scholarship in the digital age.

Disappointment with the tools aside, my larger fear is that this current generation of scholarship will be remembered by later generations as the era of what one might call “Asterisk Scholarship”. What I mean can be expressed in the following somewhat ungainly equation:

that, as the deluge of information increases at a very fast pace — including both the digitization of scholarly materials unavailable in digital form previously and the new production of journals and books in digital form — and as the tools that scholars use to sift, sort, and search this material are increasingly unable to keep up — either by being limited in terms of the sheer amount of data they can deal with, or in terms of becoming so complex in terms of usability that the average scholar can’t use it — then the less likely it will be that a scholar can adequately cover the research material and write a convincing scholarly narrative today.

In other words, I feel that the average young scholar today may have access to a tremendous amount of data — much more than any single scholar probably had access to in any previous generation — but that the same young scholar will almost certainly be unable to make use of that information to the level of sophistication that previous generations were able to achieve. This is partly due to the inadequate tools available, but also due to the lack of education regarding these tools and (possibly) due to the complicity of universities and libraries in the copyright blockades that are intrinsic to the academy today (see below).

The tangible result in the scholarship is that the number of citations and the size of bibliographies grows significantly, while (at the same time) it is not always possible for younger scholars to identify what is new or important amidst the deluge of scholarship. This is not just a problem with the deluge itself, but it is a problem with sorting the deluge. Presumably, if the tools for sorting, sifting, and searching were drastically better (unlikely in our lifetimes), then keen young minds might be able to identify the strands that are truly significant. We obviously do not want less information just because sorting it is difficult.

Thus, I would argue that in the future, when the computational tools (whatever they may be) eventually develop to a point of dealing profitably with the new deluge of digital scholarship, the backward-looking view of scholarship in our current transitional period may be generally disparaging. It may be so disparaging, in fact, that the scholarship of our generation will be seen as not trustworthy, or inherently compromised in some way by comparison with what came before (pre-digital) and what will come after (sophisticatedly digital).

Ultimately, as suggested above, this can be seen as a cognitive problem. There have been a number of studies lately, many written at a popular, accessible level, that couch the deluge of digital information in the terms of cognitive science or the history of cognition.[12] These are helpful for raising awareness of the problem, and for contextualizing the rate of growth of available information. The awareness of the problem, however, is not the same as a solution, and I do not believe a solution is readily at hand for scholarly research in the humanities, nor do I see the answer on the horizon. This is particularly true given the remarkably rapid mobilization of information-usage on phones and tablets, which is occurring at the very same time as we are witnessing the unprecedented growth in the availability of textual, scholarly information about the past.

V. Conclusions

To return to the beginning of this post, can I honestly say that there is any direct connection between the copyright/digital-library wars in academic publishing and the management of information overload that we are currently dealing with as researchers in the humanities? No, probably not. Or, if there is, the connection is so attenuated and indirect as to be impossible (for me, at least) to explain. So, why even mention the copyright situation? Perhaps to show precisely that these problems are occurring in mutually exclusive spheres, or at trajectories askew, and that the priorities of one sphere or trajectory are not exactly the priorities of the other. The last thing I want to do is assert a moral position on the copyright/digital-library wars — as I said above, I think the fact that the genie is out of the bottle is merely a description of the situation, rather than a prescription for future action. While it would be interesting to hear a sophisticated meditation on how the copyright/digital-library wars are potentially slowing the progress of scholarship in the humanities, that theme is too large for my own interests here.

In terms of the management of scholarly information, the question could be posed, do we really want to turn this over to a machine or an algorithm? No, I’m not advocating that as a solution, though my projection of the solution far into the future is partly an assumption that it will take a very “human” type of computer to help manage the information in a sophisticated way. For the moment one solution is to read less, but better. This may seem a luddite approach to the problem, but what other choice is there? Too much information has come online or is coming online — particularly for historians — to be able to take account of it all. Furthermore, among humanities scholars it has always been helpful to feel that there’s another human “on the line” with you, so to speak, and that you’re not doing this work in a vacuum. So, despite the plethora of new opportunities for working at our computers and never venturing out of the office or library, perhaps my conclusion is that academic conferences and the “slow reading” practices of previous generations will necessarily continue for the foreseeable future. However, one might safely say that it’s inevitable that the global finding aids, as well as the sorting aids on our local machines, will get more involved in the process of choosing what’s important for the Humanities (writ large) to read and argue over.[13]

Update 1 (11.18.11):

I have made a few small changes to the text of this post, mainly correcting typos pointed out by readers (thank you!). Also, I have added a few websites hosting pdfs of scholarly books to the list. The new sites were brought to my attention by readers. Again, I offer this list as an argument about the proliferation of such sites and about the proliferation generally of pdfs of modern in-copyright books.

Update 2 (11.18.11):

David Weinberger, of the Berkman Center at Harvard, linked to this blog post shortly after it was written and responded by offering some thoughts on the growing body of information in the digital world. His basic point is that we should give up trying to keep up. Specifically, he writes the following in the form of a rhetorical question:

So, for me the question is what scholarship and expertise look like when they cannot attain a sense of mastery by artificial limiting the material with which they have to deal. It was much easier when you only had to read at the pace of the publishers. Now you’d have to read at the pace of the writers…and there are so many more writers! So, lacking a canon, how can there be experts? How can you be a scholar?

The “pace of the publishers” is an interesting concept that I will need to address in the future; I had not seriously considered before that this was a positive limiting factor rather than a negative one (see my response to Joel Kalvesmaki below). That sounds naive, but I had not considered it because scholars are always naturally hungry for the newest and the best publications. Perhaps the rythym of scholarly publishing was a necessary element in the traditional process that is now thrown all out of whack (both the publishing and the digesting of publications) by the exponential growth of digital scholarship. However, we’re not quite reading at the “pace of the writers” yet, which is perhaps what provoked my consternation at the current state of affairs to begin with: in humanities scholarship, at least, we’re somewhere in the middle between the “pace of publishers” and the “pace of the writers”, which can be a herky-jerky kind of existence, always stopping and starting, rushing to keep up while also expecting publishers to be faster than they are.

(The question of canon and comprehensiveness — namely, whether it’s even worth trying to keep up — is worth a longer response in the future.)

  1. Like with the music industry — which took so long to come around to the idea — maybe it’s actually in the presses’ interest for these PDFs to get leaked. But let’s assume for the moment it’s not and that the administration of the presses would like to keep these PDFs from leaking.  ↩

  2. Using both overhead- and photocopier-style scanners, incidentally (see below).  ↩

  3. There were nostalgic stories circulating among my fellow graduate students of the magisterial scholars in our field reading mimeographed copies of classic texts during train and bus rides between Oxford, Cambridge, and London. Such stories still had resonance at the time, and I even tried to imitate my heroes by reading photocopies of the same texts on the same trains!  ↩

  4. Though at the time no one was photographing books, so who would take notice?  ↩

  5. Though many people still do that for papyri, inscriptions, manuscripts, and rare books, in situations where scanners are either prohibited or impracticable.  ↩

  6. Expensive, yes, but nevertheless available at some universities and libraries from 2003 or so on.  ↩

  7. Which I only started seeing in university libraries from around 2008.  ↩

  8. See the reactionary review by Anthony Grafton of James O’Donnell’s incredibly prescient Avatars of the Word: From Papyrus to Cyberspace (Harvard, 1998). NB: This review is unfortunately behind the Project Muse paywall.  ↩

  9. Although I’m aware that a rich variety of practices within that basic framework appear throughout history: see the excellent new book by Ann Blair, Too Much to Know: Managing Scholarly Information before the Modern Age (Yale, 2010).  ↩

  10. In my field the great examples of the latter are Roger Pearse’s and Guy Halsall’s Internet Medieval Sourcebook, but there are numerous others.  ↩

  11. Despite their best efforts, the “Zotero Everywhere” project (announced a year ago) is still in the early stages of the standalone beta: the standard non-beta release of Zotero is still a browser plugin (and for Firefox, a browser that is very much on the wane). This is not to slam Zotero — a group that has in general done a great job of maintaining communication with its users — but just to point out that the ship is turning very slowly. I note, incidentally, that the most recent Zotero blog post includes a summary of iOS and Android apps that interact with the Zotero APIs.  ↩

  12. I would recommend, above all, James Gleick’s The Information: A History, a Theory, a Flood.  ↩

  13. NB: This post is anecdotal and subjective. While I feel the need to write about my own experiences in this area, I do not pretend that this is a scientific analysis of the present or a fully informed prognostication of the future. Furthermore, it should be obvious that I do not speak for everyone involved in academic research, publishing, and library work. Despite the negative conclusion of this post, I am very grateful for the individual publishers, libraries, and institutions that have supported my own scholarly work.  ↩