Content is imported (ingested/catalogued) into DC-X using one of these methods:

  • Users or automated processes are dropping files and folders into „hotfolders“ monitored by DC-X.
  • Users are manually uploading files using the DC-X browser interface.
  • Users are creating new text documents in the web-based DC-X editor.
  • DC-X is fetching remote data via HTTP in the form of RSS or Atom feeds.
  • External software is pushing data into DC-X through its web service API.
  • Administrators are importing data into DC-X through its Unix command line tools.

The first two are (still) the most popular ways to get content into DC systems. Hotfolders are especially convenient; a lot of things can be configured for them (including field values that should automatically be added to all files arriving in the hotfolder, and how related files – like XMP sidecar files – can be found). There’s a standard way of reading field values from subfolder names; and information extraction from file names is possible, too.

DC-X has a tiny embedded workflow engine that allows administrators to configure how new content is to be handled during import. Here’s how this works:

Multiple workflow definitions can be set up (and DC-X comes pre-installed with the most common ones), for example the „workflow for importing media files“, a „workflow for importing an RSS feed“ and a „workflow for importing news agency text articles“.

A workflow definition looks like a pipeline – it lists the steps to be performed during the workflow. Each step is a call to a piece of program code, with defined input and output parameters. This makes it possible to plug pre-defined functionality together as desired. (While a workflow is executed sequentially be default, it is possible to jump to specific steps and to call child workflows, allowing for more complex workflows.)

An example for an image file import workflow definition: The step „create medium-sized preview image“ would call an image processing function with 800 pixels as the desired size, the next step „create thumbnail-sized preview image“ would call the same function, but setting the size to 400 pixels. A third step could read IPTC, XMP and EXIF metadata, a fourth step would map that data into standard and custom fields using XSLT, with the last step finally importing the input file, the preview files and the data into the DC-X database and filesystem.

When a new file is to be imported, that file is being moved into the DC-X filesystem, and a job record is created in the database. The process monitoring the hotfolder does nothing else: It does not execute the job, so the file is not yet visible to the user in DC-X. Instead, one or multiple „worker processes“ running in the background are picking up jobs and doing the actual processing (this allows for parallel imports and load distribution among multiple servers).

The job record has quite a lot of metadata attached to it: Which file is to be acted upon (can also be multiple files or documents), which workflow definition to follow, when it was created, a priority value, whether a worker process already picked it up, whether it was processed successfully, and so on.

A lot of information regarding imports and workflows can be monitored in the DC-X administration interface: Which processes are running, how many jobs are in the queue, which errors occurred. Processes can be started and stopped, hotfolders added or reconfigured, workflow definitions changed.

Workflows are useful for more than just importing content: Since jobs can be assigned to users, mixed human/machine workflows are possible. Example: A user can trigger an export workflow which automatically prepares files in specific formats and then assigns the job to another user who is to approve the export. After approval, the worker processes will once again pick up the job, transfer the files to the export destination and mark the documents as exported.

Differences compared to DC5: Hotfolder and workflow configuration is now in the database and the admin interface, no longer in .ini files. The workflow engine is completely new, including job records and mixed human/machine workflows. Hotfolder monitoring and the actual import process have been separated. The admin interface has become much more powerful in this area.

Tim Strehle
About Tim Strehle

Tim was part of Digital Collections' Research & Development team from 1999 to 2017. He is an expert for Metadata and Thesauri.

Leave a Reply