Documentation about the MODS-based .txt datasets
In addition to the MODS catalogue records and TEI text transcripts for each journal, we’re now making available three .txt-file datasets for each journal we post on the MJP Lab's Sourceforge repository. These datasets are derived from the MODS files, aggregating most of the data recorded there; they thus give users the MJP’s catalogue information about each journal without the hassle of having to extract, concatenate, and organize the data themselves.
Why there are three datasets—and what appears in each one
We’re making three .txt datasets available for each journal that appears on the Sourceforge repository, since each dataset configures the MODS data in a different way and allows for different kinds of analyses. Here, for instance, are the three datasets we've uploaded for Poetry Magazine:
The first dataset above offers a quick overview of Poetry by listing issue-level information only, while the other two datasets offer more detailed views of the journal by including item-level information along with the issue-level data from the first dataset. Though the last two datasets contain much of the same data, the second one offers an exhaustive record of every title in Poetry, while the third provides an exhaustive record of every contributor to the magazine.
Dataset 1: Journal overview
The "journal overview" dataset contains only general information about each issue of a magazine, drawn from the top section of the MODS file for every issue. By excluding all data about the contents of individual issues (e.g., articles, letters, pictures), this is by far the smallest of the three datasets and affords users a quick overview of the journal as a whole. The “journal overview” dataset provides information (if recorded in the MODS files) for each issue in the following twelve categories:
- journal title: This is the primary title of the journal: e.g., “Egoist,” “Little Review.” Please note that we’ve omitted the preceding article (e.g., The), if it exists, from the main title of the journal, in order to facilitate meaningful sorting—so “The Egoist” simply appears as “Egoist”. The journal title information corresponds to the content of the mods:title element in the MODS file at this xpath: //mods:mods/mods:titleInfo/mods:title
- journal subtitle: This is the secondary title that may appear on the cover or contents page of a magazine: e.g., “An Individualist Review” for some issues of The Egoist. This info corresponds to the content of the mods:subTitle element in the MODS file at this xpath: //mods:mods/mods:titleInfo/mods:subTitle
- issue name: In addition to a subtitle, some magazine issues carry a specific name (usually on the cover) that conveys information about the issue's special theme or contents: e.g., issue 2.5 of The Egoist is designated the “Special Imagist Number,” while issue 5.4 of The Little Review is the “Henry James Number.” This info corresponds to the content of the mods:partName element in the MODS file at this xapth: //mods:mods/mods:titleInfo/mods:partName
- volume: This is the number of the volume that an issue belongs to: e.g., Freewoman 2.28, Poetry 21.2. We’ve extracted this info from the mods:partNumber element in the MODS file at this xpath: //mods:mods/mods:titleInfo/mods:partNumber
- issue: This is the number of the issue within a certain volume: e.g., Freewoman 2.28, Poetry 21.2. We’ve also extracted this info from the mods:partNumber element in the MODS file at this xpath: //mods:mods/mods:titleInfo/mods:partNumber
- date: This is the issue’s date of publication, expressed as an eight-digit number: YEAR-MO-DY. The date of the February 8, 1912 issue of the Freewoman thus appears this way in the dataset: 1912-02-08. We’ve extracted this info from the mods:dateIssued element in the MODS file at this xpath: //mods:mods/mods:originInfo/mods:dateIssued[@keyDate="yes"]; and when we processed the dataset, we converted the 6-digit dates of monthlies and quarterlies, as well as the 4-digit dates of annuals, to this 8-digit expression, in order to enable comparisons of all MJP journals by their publication dates.
- journal editor: This is the primary editor of the journal, listed by “last name, first name”: e.g., “Monroe, Harriet” of Poetry. If a journal has two main editors, both are listed here, in the order of their appearance in the MODS file: e.g., “Marsden, Dora; Gawthorpe, Mary” of The Freewoman. This info corresponds to the content of the mods:namePart element in the MODS file at this xpath: //mods:mods/mods:name/mods:namePart
- publisher: The journal’s publisher can be either a company (Stephen Swift and Co. Ltd.) or an individual (Harriet Monroe); in both cases, the publisher’s name is written exactly as it appears in the publication (for persons, that's generally: first name last name). This info corresponds to the content of the mods:publisher element in the MODS file at this xpath: //mods:mods/mods:originInfo/mods:publisher
- journal location: This is where the magazine was published—usually a major city: e.g., London, New York, Chicago. This corresponds to the content of the mods:placeTerm element at this xpath in the MODS file: //mods:mods/mods:originInfo/mods:place/mods:placeTerm[@type="text"]
- issue length (pp): This is the total number of pages in any issue of the magazine (rather than merely the issue’s numbered pages), including both front and back covers and all advertising pages. We’ve extracted this info from the mods:extent element in the MODS file at this xpath: //mods:mods/mods:physicalDescription/mods:extent
- issue height (cm): This is the physical height of the issue (measured in centimeters). We’ve also extracted this info from the mods:extent element in the MODS file at this xpath: //mods:mods/mods:physicalDescription/mods:extent
- issue width (cm): This is the physical width of the issue (measured in centimeters). We’ve also extracted this info from the mods:extent element in the MODS file at this xpath: //mods:mods/mods:physicalDescription/mods:extent
Datasets 2 and 3: item-level data for “every title” and “every contributor”
The two remaining .txt datasets contain all of the information about a magazine that appears in the twelve data fields above, but they additionally include information about the individual items within the journal. Each dataset is moreover configured so every item listed in it includes (in the same row or string) all of the information from the first dataset about the issue in which it was published. Thus, both of the item-level datasets for Poetry record Ezra Pound’s “In a Station of the Metro” as being a poem that appears on page 12 of issue 2.1 of the magazine, but they also associate the poem with that issue’s date, its physical size, its overall length, who edited and published it, where the issue was published, and whether the issue carried a subtitle and issue name.
While the info in the two item-level datasets is largely the same, the second “every title” dataset offers the most complete and accurate account of every item published in the magazine, while the third “every contributor” dataset gives the most complete and accurate account of every contributor to the journal. We felt we needed to create these two versions of the item-level data since there often isn't a one-to-one correlation between contributors and items/titles in a magazine: some items (like ads, contents pages, etc.) lack authors, others have more than one author, while still others have an editor and/or translator(s) in addition to their author(s).
The “every contributor” dataset, therefore, lists together, in a single “contributor” data field, every author, artist, editor, and translator who made a contribution to the magazine, giving one mention per contribution. This means that items created by multiple contributors are mentioned multiple times in this dataset, while items not associated with a contributor are not represented at all.
The “every title” dataset, by contrast, mentions every item/title in the magazine just once. If an item has multiple authors, they will be listed together, in a single cell for that item, in the “creator” list (which means that you will only be able to sort by the first author listed, which corresponds to the first author who appears in the MODS record for that item). This dataset, however, places the “creators” (authors and artists), “editors,” and “translators” of items in three separate data fields, which makes this dataset additionally useful for discerning these different contributions to the magazine.
What follows is a detailed account of the seven data fields in these two datasets that don't also appear in the "journal overview" dataset:
- creator (in the “every title” dataset only): This field includes any author or artist who has created an item that appears in the magazine. The info listed here corresponds to the contents of the mods:namePart element when the xpath in the MODS file is //mods:mods/mods:relatedItem/mods:name/mods:namePart and the content of the associated mods:roleTerm is “creator”.
- editor (in the “every title” dataset only): Not to be confused with the “journal editor” or any other editor on the staff of the magazine, the “editor” field refers only to those contributors to the magazine who are identified as having edited a particular item within it. This happens rarely, so it’s possible that the entire "editor" column in the dataset for a journal will appear empty. The info listed here corresponds to the contents of the mods:namePart element when the xpath in the MODS file is //mods:mods/mods:relatedItem/mods:name/mods:namePart and the content of the associated mods:roleTerm is “editor”.
- translator (in the “every title” dataset only): The info listed here corresponds to the contents of the mods:namePart element when the xpath in the MODS file is //mods:mods/mods:relatedItem/mods:name/mods:namePart and the content of the associated mods:roleTerm is “translator”. When there are two or more translators for an item, they will appear together, in the single “translator” cell for that item, in the order of their appearance in the MODS record: e.g., Florence Ayscough and Amy Lowell, the two translators of Wen Cheng-ming’s “Chinese Written Wall Pictures” in Poetry 13.5, appear in the dataset as “Ayscough, Florence; Lowell, Amy”.
- contributor (in the “every contributor” dataset only): This field combines, in a single field/column, all “creators,” “editors,” and “translators” of items in a magazine. As we mentioned above, authors, artists, editors, and translators are listed in this field every time they contribute (or participate in contributing) an item to the magazine. The info in this feild corresponds to the contents of the mods:namePart element at this xpath in the MODS file: //mods:mods/mods:relatedItem/mods:name/mods:namePart
- title (in both item-level datasets): The “title” field records the names of individual items published in the magazine. This information appears in the MODS file within the mods:titleInfo element for an item at the following xpath: //mods:mods/mods:relatedItem/mods:titleInfo. The title info may include the contents of the following five elements: 1. mods:nonSort: an initial article that precedes the main title; 2. mods:title: the item's primary title; 3. mods:subTitle: any secondary title for the item; 4. partNumber: the number assigned to the item if it’s a part of a larger series; and 5. partName: the name assigned to the item if it's part of a larger series. When combined within the “title” field for an item, these five elements will always be arranged in the following order with the following punctuation: article + title: subTitle—partNumber: partName. For instance: The Reader Critic: 'Spiritual Adventures'; or, Songs and Sketches—I: Night.
- genre (in both item-level datasets): When the MJP staff create the MODS records for a magazine, they assign a genre to each magazine item they catalogue. The field of genres available to the cataloguer is currently limited to the following seven kinds of texts: advertisements, articles, drama, fiction, images, letters, and poetry; accordingly, each item in the item datasets will be associated with one of these seven genres. The info in the genre field corresponds to the content of the mods:genre element in the MODS file at this xpath: //mods:mods/mods:relatedItem/mods:genre
- pages (in both item-level datasets): This field indicates the location of the item within the magazine. The page location is always expressed as a span of pages (e.g., 10-25), even if the item appears on only a single page (e.g., 8-8). If the item is interrupted within the magazine, multiple page spans will be listed here: e.g., 6-9 14-15. These page numbers reflect the pagination that appears in the published magazine (along with the MJP cataloguer’s pagination of any unnumbered pages), rather than the absolute count of pages (from cover to cover) that appears in the page images view of the magazine on the MJP website. This page information corresponds with the content of the mods:start and mods:end elements at this xpath in the MODS file: //mods:mods/mods:relatedItem/mods:part/mods:extent
Additional Features of the Datasets
How the datasets are organized
Each journal dataset pulls together the information from all of the MODS files that the MJP has created for that journal, and it arranges the info by order of publication date: so the informaiton for the earliest issue of that magazine will appear at the top of the dataset and the info for the last issue will appear at the bottom. The info for each issue in the dataset will in turn be ordered exactly as it appears within the source MODS file: by its order of appearance, page by page, within the covers of the magazine.
When the MJP staff catalogue the contents of a magazine, they occasionally decide to embed some items in the MODS file within other items; this is generally done for those items that contain many small parts that we'd rather not show up in the contents pages for the magazine (but do want to show up in searches for the journal)—like the incidental illustrations that accompany some texts. In creating the item datasets from the MODS files, we decided to include all items, top-level and embedded both, which now will appear on the same level. In the case of poem series, this may cause some redundancy in terms of page counts, since the same item may appear twice: first as the one encompassing title, and then as the sum of its component parts.
Things we've left out of the datasets: Besides the article that precedes a journal's title (see above), we've decided to exclude from the datasets any notes or tablesOfContents appended to the MODS record for an item, since these are often discursive accounts of mostly incidental matters, made at the discretion of the cataloguer, that don't add much to the information about an item.
More about Names
Except in the case of publishers, the names of people in the various data fields (contributor, creator, editor, translator, journal_editor) will appear "last name, first name"—e.g., Rodker, John; Davies, Mary Carolyn; Eliot, T. S.—which faciliates meaningful sorting by last name. We also are including, in parentheses after the person's name, any term of address that has been recorded for an individual in the MODS file: e.g., Wilson, Edmund (Jr.); Van Rennselaer, Schuyler (Mrs.); von Freytag-Loringhoven, Else (Baroness). Because the datasets record names, not persons, the same person may show up in these datasets under several name variants, abbreviations, or aliases. (The MJP policy is to record the name as it appears in the magazine, so any variation or mistake in the spelling of a person's name in the magazines will also show up in our data about it.)