Source

mdhub /

Filename Size Date modified Message
bin
config
docs
mdhub
192 B
60 B
3.9 KB
541 B
614 B
431 B
454 B

README

Working title: Metadata Hub (mdhub).

Loose objective: Import, transform and manage metadata coming from different sources such as SWB (Südwestdeutschen Bibliotheksverbundes), CSV files, XML files and the like.

An ascii schema:

                    +--------+
                    | LIBERO |              +-------+-------+
                    +---+----+              |       |       |
                        |          +--------+----+  |       v
                        |          | Metadata DB |  |  +-----------+
                        |          |             |  |  | Beanshell +----+
                        v          |             |  |  +-----------+    |
+-------------+     +-------+      |             |  |  +-------+        |
| XML         +---->| mdhub +----->|             |  |  | Index |        |
+-------------+     |       |      |             |  |  |       |        |
                    |       |      |             |  |  |       |        |
+-------------+     |       |      |             |  +->|       |<-------+
| CSV         +---->|       |      |             |     |       |
+-------------+     |       |      |             |     |       |
                    |       |      |             |     |       |
+-------------+     |       |      |             |     |       |
| SWB/Marc 21 +---->|       |      |             |     |       |
+-------------+     +-------+      +-------------+     +-------+
                        ^
+-------------+         |
| ...         +---------+
+-------------+

Application should be a command line application. It should be highly configurable via XML or some other format.

Logs should be accessible and searchable via Browser frontend.

Basic pymarc API usage

See: https://gist.github.com/1319893

Specs

A sample XSD for processing.

<datasource type="marc" location="/home/net/daily/TA*mrc" db_specs="...">
    <proc_marc copyall_fields="false">
        <rule in_field="001" out_field="003">
            <process>
                <iconv from="latin-1" to="utf-8"></iconv>
                <regex>s/.*/prefix_$1</regex>
            </process>
        </rule>
    </proc_marc>
</datasource>

1) A datasource, could be CSV, XML, MARC, ...

Question Do we need to specify, what we want to do with something after we know what it is (e.g. CSV)?

Typical transformations on fields: iconv, regex, prefix, ...

Transformations only work on single fields (a single marc tag, a single CSV column, a single element or attribute).

Possible routes

  • PyXB (automatic generation of python classes from XSD)
  • generateDS (generate data structures from XSD)
  • lxml (objectify)

Let's try lxml first.

Question Is our DB scheme just a relational version of our enhanced marc record?

Processing Rules

  • On file sets

    • Maybe some sanity checks
  • Whole file processing rules

    • change encoding
  • [MARC] Per field processing rules

    • prefix, suffix, infix (maps from same tag to the same)
    • copy one tag value to another tag value (e.g. if not unique ID is found in some default place)
    • some arbitrary action
  • [CSV] Per column processing rules

Question Since we can have arbitrary rules (e.g. fetch some value from some source, if we got a match, than insert it in a specified marc tag), maybe we should look into some build automation system, where we can leverage a lot of filesystem operations and can extend the system on our own with our specific tasks. Candidate: rake

Database

The processing software is not persistent. It needs to talk to the DB. We can use some key-value store for quick tests. An ORM for intermediate use (MySQL, Postgres, Sqlite3). Lastly raw SQL for speed, if we decided which DB we use.