Importing from XML / TEI

Source data

This example uses the TEI-encoded “Wizard of Oz” musical libretto from the New York Public Library as its source data. As in many TEI documents, the encoding is not entirely consistent, and so Feeds Tamper will be used to try to clean up some of the data. While this example uses an XML file, the same general process could be used with an HTML file.

This example illustrates how to extract songs from the libretto and save each as its own node.

Create a content type called “Song”, and add the following fields:

  • “Act”, text field with one value
  • “Scene”, text field with one value

If this were a real project, you would likely have “Scene” and “Act” content types, and “Scene” would be a node reference field pointing to the correct “Scene” node, and “Act” would be a node reference field within the “Scene” content type. Since the goal of this example is to illustrate extracting data from an XML file using Feeds, simply using text fields here will accomplish that goal, even if it’s unrealistic from the perspective of actual project implementation.

Creating the importer and basic configuration

Create a new feeds importer by going to Structure > Feeds importers > Add importer. For the name, call it “Song importer”, with the description “Imports songs from the Wizard of Oz TEI file.”

Under “Basic settings”, but change “Periodic import” to “Off” and save.

Set the fetcher to “File upload” and the parser to “XML XPath parser”. If you don’t see this option, make sure the Feeds extensible parser module is installed and enabled.

Settings for XML Xpath parser

The XML Xpath parser settings page includes a table similar to the “Mapping” configuration screen. First, you must put in an XPath expression for the “context”. The context tells Feeds how to find each “thing” in the XML document that you want to import as a node. From there, you can add multiple “sources” by putting in an XPath expression relative to the source. The data that can be found using the XPath expression for each source will be available on the “Mapping” configuration screen.

In this example, songs can be found at TEI/text/body/div/div/div[head] (which is to say, within the body of the TEI document,a <div> element inside an act <div> and scene <div>, where that <div> element has a <head> element within it). Inside the value field for Source, enter

TEI/text/body/div/div/div[head]. Then, create the following sources with the following values:

  • Song title, head
  • Content, . (the value here should be a single period)
  • Scene, ../@n
  • Act, ../../@n

Processor settings, mapping and debugging

The default node processor setting is correct. In the settings for the node processor, choose “Song” under “Bundle”. Under “Update existing nodes”, select “Replace existing nodes”. For the author, choose your own account.

Create the following mappings:

Source

Target

Target configuration

Song title

Title

Used as unique

Scene

Scene

 

Act

Act

 

Content

Body

 

At this point, it’s worth running the import by going to the import page, choosing “Song importer”, uploading the Wizard of Oz XML file, and clicking “Import”. Once it runs, you should see a notification “Created 13 nodes”.

Click on “Content” in the administration menu and look at the nodes that have been imported.

The node titles need to be cleaned up: they are all caps, some include one or more hyphens, some are in quotes, and some end with a period. Also, some of the items imported don’t include the actual lyrics to the song in the body field. These should be removed.

For those nodes that do have the actual song lyrics, the XML should be stripped out; this is the default configuration. At the same time, the easiest way to identify whether the text that gets imported into the body field is song lyrics or other text is to look for <l> (verse line) elements. Go back to editing the parser settings and add a new source, “Content with markup”, with the same XPath value as “Content”. For this new field, however, check the “raw” checkbox.

On the “Mapping” page, add a mapping from “Content with markup” to “Temporary target”.

Feeds tamper

Song title -> Title

Add the “Find replace” plugin. Insert quotation marks in the “Text to find” field, and leave the second text field blank. Add the “Convert case” plugin and choose the default “Title Case” setting. Then, add the “Find replace” plugin, and change the machine name so it doesn’t conflict with the version of the same plugin you added earlier. Insert a hyphen in the “Text to find” field, and leave the second text field blank. Next, add another “Find replace” plugin, change the machine name, and put a period in the “Text to find” field, leaving the second text field blank. Finally, add the “Trim” filter. Leave the text field blank, and leave the default “Both” setting.

This will strip the quotation marks, then convert the titles in all caps to a more reasonable format, with only the first letter of each word capitalized. Then, it will strip out the hyphen characters, then periods. Last, it will remove excess whitespace. If you don’t strip the quotation marks first, the “Convert case” plugin will treat the quotation marks as the first “letter” of the first word, and will fail to capitalize the actual first letter of the word.

If you rerun the import at this point, you’ll notice that it creates numerous new nodes. This is because you’ve set the title field to be unique, and after you’ve stripped out the punctuation marks, the reformatted title no longer matches what you previously imported, and so Drupal won’t update the existing node.

Content with markup -> Temporary target one

Add the plugin “Keyword filter”. Under “Words or phrases to filter on”, enter <l> and save. This will exclude from the import any song that does not include line elements.

Importing

If you have already done test imports, you should delete all the previously imported items. Since you added a filter to weed out the songs without line elements, the previously-imported versions of those songs will linger on your site if you don’t delete them. The easiest way to delete them is to just delete all previously imported items by clicking the “Delete items” tab on the import screen, and choosing “Delete”. This time when you run the import, it should only create 7 nodes.

Tags: