A typical DC-X system receives data from lots of different sources: News agencies, editorial systems, files in hotfolders, e-mails, RSS feeds. So there is a lot of code that deals with parsing various data formats and inserting content, metadata and files into DC-X.

In DC4, all of that code was written in PHP and doing everything at once: Reading input data, parsing and extracting metadata, creating a DC document object and inserting it into the database. We learnt later on that such tight coupling does cause some headaches.

With DC5, we broke this process down into several reusable steps and tried to move everything to XML: If the input format wasn’t XML, we’d run some PHP code to translate it into XML first. The input XML was then converted to the DC5 XML format using XSLT, and in a last step the DC5-formatted XML was imported into the database.

This proved to be a good decision, so we’re doing the same thing in DC-X (with some fun improvements, of course). Only the burden of converting various formats into our own remained with us.

But times are changing. We’re very happy that customers and partners are starting to ask, „in which format should we deliver data to DC-X?“. Offering a generic input format that others can code towards gives control to our customers (and takes a little work off our shoulders…).

DC-X has its own XML format that maps closely to its database structure, but it’s not in any way a standard format so you as an implementor would be left alone with the (maybe poor) documentation we’re providing, and a running DC-X installation as the only way to test your output.

Instead we’d like to rely an existing standard, and we think the Atom Syndication Format (RFC 4287) is a good choice: It is XML-based, simple, well-specified, extensible, and widely implemented. You can use the RSS reader of your choice to test your output. If it’s a valid Atom feed and looks fine in your RSS reader, you know it’s going to import well into DC-X.

You can either make your data available as an Atom feed that DC-X can fetch over HTTP, or you can put a file containing the feed XML or a single entry into a DC-X hotfolder. (DC-X also supports the Atom Publishing Protocol, RFC 5023, for creating DC-X documents by sending the same format via a HTTP POST.) Image and other files to be imported are referenced with the standardized link rel=“enclosure“ construct (except when you are using the Atom Publishing Protocol). Special DC-X metadata can be embedded using the DC-X XML namespace.

Here’s an example (an image file with some metadata):

 <?xml version="1.0" encoding="UTF-8"?>
 <entry xmlns="http://www.w3.org/2005/Atom">
   <!-- ID (optional) -->
   <id>my-doc-5p3svhdupvolejj7efw</id>
   <!-- Reference to the associated file (file or HTTP URL, optional) -->
   <link rel="enclosure" href="file://filename.jpg" type="image/jpeg"/>
   <!-- Creation date (optional) -->
   <updated>2009-05-06T09:39:37+02:00</updated>
   <!-- Author (optional) -->
   <author>
     <name>John Doe</name>
     <email>john.doe@example.com</email>
   </author>
   <!-- Title -->
   <title type="text">The remains of a car bomb are seen at the site 
of a bomb attack in Baghdad</title>
   <!-- Text as XHTML, always embedded in a <div> element -->
   <content type="xhtml">
     <div xmlns="http://www.w3.org/1999/xhtml">
       <div>The remains of a car bomb are seen at the site of a bomb attack in
Baghdad May 6, 2009. A vehicle bomb killed 10 people and <b>wounded</b> 37 others
on Wednesday when it exploded in a wholesale vegetable market in southern
Baghdad, police said.  REUTERS/Ahmed Malik (IRAQ CONFLICT POLITICS)</div>
     </div>
   </content>
   <!-- Additional meta data in the native DC-X XML format (optional) -->
   <document xmlns="http://www.digicol.com/xmlns/dcx" version="1.0">
     <head>
       <Country>Iraq</Country>
       <Provider>REUTERS</Provider>
       <City>Baghdad</City>
       <Keywords>:rel:d:bm:GF2E5560KT301</Keywords>
       <Keywords>War</Keywords>
       <Person>Ahmed Malik</Person>
     </head>
   </document>
 </entry>

What do you think?

One Response to Atom (RFC 4287) entry or feed as the standard DC-X input format

[top]

Leave a Reply