Source

papers / code4lib13 / isis2couchdb-unwrap.txt

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
by Luciano G. Ramalho

<h2><a name="section1"></a>1. Introduction</h2>

The relational database model is well grounded in theory and well supported by the software industry. Relational database management systems (RDBMS) replaced legacy hierarchical and network database products in nearly every enterprise and became the mainstream. Object-oriented DBMS have not taken the leading role from relational systems, even after object-oriented programming became the norm, and in spite of the so-called “impedance mismatch” perceived when one needs to store nested objects into a collection of flat tables.

After resisting the object-oriented wave, relational databases became so dominant that they seemed like the only game in town, or at least the only game in most shops. But in the last five years, Google published a paper on BigTable (<a href="#chang2006">Chang, 2006</a>), the proprietary, distributed non-relational database behind many of its online properties. Then Amazon.com revealed Dynamo, the non-relational database that powers some of their AWS cloud-based services (<a href="#decandia2007">Decandia, 2007</a>). Facebook open-sourced Cassandra, the distributed non-relational database created to enable its crucial in-box search feature. Cassandra is now also used by Twitter, Digg, Rackspace and Cisco, among many others, and has such a vibrant community that it became a top-level project of the Apache Foundation in 2010, less than three years after its initial public release (<a href="#apache2011a">Apache, 2011a</a>). These developments are part of a trend that became known, in 2009, as the “NoSQL" movement. NoSQL stands for “No SQL” or “Not Only SQL”, depending on who you ask; but everyone agrees it's not really about the SQL language, but about seeking alternative answers to problems that are not amenable to relational solutions.

Meanwhile, thousands of small to medium libraries in developing countries have been oblivious to all this and continue using their ISIS databases for daily operations. ISIS is a family of non-relational database systems with a history that goes back to the 1970's. It was developed by UNESCO – the United Nations Educational, Scientific and Cultural Organization – specifically for bibliographic data, and it is still officially distributed and widely used (<a href="#hopkinson2005">Hopkinson, 2005</a>).

ISIS had a very positive impact, allowing libraries with limited resources to computerize their catalogs on simple, stand-alone PCs. Some ISIS products became Web-enabled, but after that, progress has been very slow in the last 10 years. Meanwhile, a couple of new Open Source NoSQL databases present a possible migration path for ISIS users, a path that will not require wholesale normalization of decades of bibliographic data. This paper describes how BIREME/PAHO/WHO, a digital library that is part of the Pan American Health Organization, and a large-scale user of ISIS databases, has studied the viability of migrating one of its main databases to Apache CouchDB, a modern document database.

The rest of the paper is structured as follows. <a href="#section2">Section 2</a> discusses data models for bibliographic records, explains why it is tempting for librarians to look beyond the relational model and compares the structure of MARC records to the semistructured data model formalized in the 1990's. <a href="#section3">Section 3</a> briefly presents ISIS, then compares two recent products that support semistructured records, CouchDB and MongoDB, and shows our criteria for choosing among them. <a href="#section4">Section 4</a> discusses different formats for representing ISIS records in CouchDB, then describes the tool we developed for converting records and the process used to load them into CouchDB. <a href="#section5">Section 5</a> talks about how indexing works in ISIS and CouchDB, and exemplifies queries on the latter. Finally, <a href="#section6">Section 6</a> summarizes what we have done, our conclusions and ongoing work.

<h2><a name="section2"></a>2. Data models for bibliographic records</h2>

One of the defining characteristics of the relational data model, the First Normal Form (1NF), dictates that attributes must be atomic. In other words, in the industry-standard flat relational model<sup><a name="noteorig1" href="#note1">1</a></sup>, fields cannot be structured into subfields nor contain multiple values.

On the other hand, multivalued fields are useful and common for representing bibliographic data. Books often have more than one author and cover multiple subjects. A publisher, while logically a single attribute of a book record, has attributes of its own, like a name and a place. It is therefore useful to split a publisher field into parts. That is why MARC supports subfields and multivalued fields through repeating tags (similar to XML). The MARC structure, or “empty container”<sup><a name="noteorig2" href="#note2">2</a></sup> carried over to the ISO-2709 standard, which describes a generic data interchange format allowing applications to attach any meaning to the tags and subfield markers. Here we will refer to the “ISO-2709 data model” as a generalization of the data model implied by the MARC record structure.

Of course, library data is often stored in relational databases, but normalization means that the description of a single item is spread over multiple records in several tables. That may work well within the context of one library, but when exchanging data, it is more practical to have just one record per item. That is one reason why MARC is still with us, and why XML became so important in a world dominated by relational databases.

If the ISO-2709 data model does not adhere to the flat relational model, then what is the theory behind it? At first there was none, but in the mid-1990s, while XML was still under development, researchers from UPenn, Stanford, AT&T Labs and the INRIA French R&D agency conceived the semistructured<sup><a name="noteorig3" href="#note3">3</a></sup> data model (<a href="#abiteboul1999">Abiteboul, 1999</a>). Their work supports reasoning about MARC and ISO-2709 records but also about the richer data models of XML and JSON<sup><a name="noteorig4" href="#note4">4</a></sup> documents.

A useful definition by Dan Suciu, a pioneer of semistructured data research, appears in the “Encyclopedia of Database Systems” (<a href="#suciu2009">Suciu, 2009</a>):

<blockquote> The semi-structured data model is designed as an evolution of the relational data model that allows the representation of data with a flexible structure. Some items may have missing attributes, others may have extra attributes, some items may have two or more occurrences of the same attribute. The type of an attribute is also flexible: it may be an atomic value or it may be another record or collection. Moreover, collections may be heterogeneous, i.e., they may contain items with different structures. The semi-structured data model is a self-describing data model, in which the data values and the schema components co-exist. </blockquote>

Thanks to the semistructured data model, we don't need to feel ashamed of our denormalized bibliographic records any more. Even better, we have a framework to evaluate databases of the current NoSQL crop with regard to our needs as caretakers of denormalized datasets.

The book “Data on the Web: From Relations to Semistructured Data and XML” (<a href="#abiteboul1999">Abiteboul, 1999</a>), introduces a notation for representing semistructured data. It is called “ssd-expression” and it looks like this:

<pre>
  {name: {first: "Lewis", last: "Carroll"},
   tel: 5553457,
   email: "cld@ox.ac.uk",
   email: "lcarroll@pobox.com"
  }
</pre>

Except for the repetition of the <code>email</code> key, it's very similar to JSON (<a href="#crockford2006a">Crockford 2006a</a>). The presence of keys describing fields like <code>name</code>, <code>first</code> etc., and the storage of those keys alongside the data, is what the encyclopedia entry means by coexisting data values and schema components (<a href="#suciu2009">Suciu, 2009</a>). It is a characteristic shared by ISO-2709, XML, and JSON (with the limitation that ISO-2709 fields are identified by tags composed of three alphanumeric ASCII characters, though most applications only use numeric tags). The repeated <code>email</code> tag could be represented similarly in XML and in ISO-2709, but in JSON the keys must be unique within an object structure, so multivalued fields are often represented as lists of values:

<pre>[sourcecode language='javascript']
  {"name": {"first": "Lewis", "last": "Carroll"},
   "tel": 5553457,
   "email": ["cld@ox.ac.uk", "lcarroll@pobox.com"]
  }
[/sourcecode]</pre>

The ISO-2709 record format influenced the record structure of the ISIS family of databases, introduced by UNESCO in the 1970s. ISIS records follow the structure of the ISO-2709 standard, and most ISIS systems import and export to that format. Here is the same record of the examples above, as seen in a common ISIS display format:

<pre>
  10 «^fLewis ^lCarroll»
  20 «5553457»
  30 «cld@ox.ac.uk»
  30 «lcarroll@pobox.com»
</pre>

In an ISIS record the field names are replaced by numeric tags. The «» delimiters are not part of the content, but a feature of the display format. Here there are two occurrences of tag #30, the e-mail field. Field #10 is split into two subfields ("f" and "l") by special markers. Instead of the subfield delimiter control character used in MARC systems, ISIS uses the circumflex accent: ^ (ASCII 94). A subfield identifier is case-insensitive and consists of just one ASCII letter or digit. The ISIS data model is simpler than that of ISO-2709 because it does not have field indicators<sup><a name="noteorig5" href="#note5">5</a></sup> and subfields are not repeatable<sup><a name="noteorig6" href="#note6">6</a></sup>.

Clearly the ISIS data model is not as flexible as the general semistructured data model. Given the syntax of subfield markers, ISIS fields are limited to one level of nesting only. In other words, borrowing from the XML jargon: subfields cannot have child elements, just character data. Also, an ISIS field may have mixed content<sup><a name="noteorig7" href="#note7">7</a></sup>: some unmarked text may appear before the first subfield marker. For example, in this field:

<pre>
  18 «Lords of Finance^bThe Bankers Who Broke the World»
</pre>

The main title, "Lords of Finance" is not preceded by a subfield marker. The semistructured data model, as described by Abiteboul et. al., has no concept of mixed content like XML has. So we will call that part “the main subfield” and pretend it is preceded by an implied "^_" (underscore) subfield marker whenever we need to refer to it in code. With this simple arrangement, any ISIS record can be represented as an ssd-expression or a JSON record.

<h2><a name="section3"></a>3. Semistructured database systems</h2>

<h3>3.1. The ISIS family</h3>

In 1985, Giampaolo Del Bigio ported UNESCO's CDS/ISIS mainframe database system to PC/DOS (<a href="#lopes2010">Lopes, 2010</a>). The result, called MicroISIS, was distributed free of charge and became the <i>de facto</i> standard for computerized library catalogs in developing countries. Its Windows version, WinISIS is still distributed by UNESCO and remains widely used in small to medium libraries (<a href="#hopkinson2005">Hopkinson, 2005</a>). Beyond local OPACs, ISIS is also used by two large regional bibliographic databases in Latin America and the Caribbean: SciELO (Scientific Electronic Library Online) and LILACS (Latin American and Caribbean Health Sciences index), both built by BIREME/PAHO/WHO, which is a specialized center of the Pan American Health Organization located in São Paulo, Brazil. Together, SciELO and LILACS routinely handle millions of bibliographic records using CISIS, a custom version of ISIS developed by BIREME/PAHO/WHO. MicroISIS, WinISIS, CISIS and the more recent J-ISIS from UNESCO are all interoperable and form the ISIS family of database systems.

The legacy ISIS codebases are showing their age, and after years of closed-source development, none of the derivations have managed to become successful Open Source projects. Meanwhile, the NoSQL movement has highlighted some Open Source non-relational databases implementing variations of the semistructured data model. This has motivated BIREME/PAHO/WHO to look for alternatives.

<h3>3.2 CouchDB and MongoDB</h3>

Among the recent crop of NoSQL systems, two products stand out for implementing the semistructured data model in a way that matches the operational needs of BIREME/PAHO/WHO: Apache CouchDB and MongoDB (<a href="#apache2011b">Apache, 2011b</a>, <a href="#mongodb2011">MongoDB.org, 2011</a>). Their data model is not as flat as the key-value stores (optimized for retrieving blobs given a primary key), nor as deep as graph databases (designed to allow general queries over paths of nested objects). What they offer is somewhere in between: JSON-like records allowing nested structures, and expressive query languages to index and retrieve those records. Those rich records are called “documents”. CouchDB and MongoDB call themselves “document databases”.

In the case of CouchDB, the document format is JSON. MongoDB uses BSON, a binary format inspired by JSON but offering more data types, such as int32, datetime, and even a type called “JavaScript code w/ scope” (<a href="#dirolf2010">Dirolf, 2010</a>). JSON can be trivially converted to BSON, but the reverse may not be so easy, depending on the data. While the difference between those formats can be considered an implementation detail, they reflect different priorities.

MongoDB is optimized for fast updating, by aggressively caching writes and by overwriting updated records in place whenever possible: the BSON structure is designed to allow updating specific fields of an existing document. The flip-side of this optimization is that a software crash can cause data loss, and a minimum deployment of two servers in master-slave configuration is recommended if durability is required<sup><a name="noteorig8" href="#note8">8</a></sup>. In contrast, CouchDB is fault-tolerant: it only appends to the database file on disk, and that is an atomic operation in modern operating systems. As a result, the database file is always consistent in the event of a software crash, and backups can be made while the system is running. The drawbacks of the append-only design are the need for periodic database compaction – a time-consuming batch operation – and slower updates.

CouchDB was influenced by Lotus Notes, a networked, collaborative application suite designed to support users who are often off-line. So, CouchDB allows master-master replication, that is, synchronization between peer nodes which have received inserts and updates independently. MongoDB supports only master-slave replication: only one node receives updates and inserts, and replicates to the slaves.

While CouchDB can be used in large clusters, it is also well-suited as a database for small desktop apps. For that reason it comes pre-installed on Ubuntu Linux since 2009. It even runs on Android mobile devices (a free, one-click installer can be found in the Android Marketplace). Because CouchDB supports HTTP and JSON natively, and can run JavaScript procedures on the server, applications can be developed without any middleware: browsers communicate directly with the database, and all the logic is written in JavaScript.

The informative post “Comparing Mongo DB and Couch DB” was written by Dwight Merriman, CEO of 10gen, the company behind MongoDB (<a href="#merriman2010">Merriman, 2010</a>). Given the characteristics of these systems, we decided to use CouchDB for our LILACS experiments for these reasons:

<ul> <li>Easier deployment in a durable configuration.</li> <li>Support for master-master replication, useful for distributed cooperative   cataloging applications.</li> <li>Direct support for JSON over HTTP, enabling Web Services and AJAX   applications without middleware.</li> </ul>

However, we do envision scenarios in which we would like to use both MongoDB and CouchDB. For example, the canonical store of bibliographic data could be CouchDB, while a cluster of MongoDB instances could be used for user-provided content, tracking, recommendations and other features demanding faster database writes.

<h2><a name="section4"></a>4. Loading ISIS records into CouchDB</h2>

<h3>4.1. Alternative representations of ISIS records as JSON documents</h3>

The ISIS data model, a subset of the ISO-2709 model, is not as expressive as the JSON data model. There are several ways to represent the same ISIS record in JSON. Consider this abridged example of a LILACS record:

<pre>
   1 «BR1.1»
   2 «538886»
   4 «LILACS»
   4 «LLXPEDT»
   5 «S»
   6 «as»
   8 «Internet^ihttp://…/imageBank/PDF/v3n3a04.pdf?aid2=168&amp;…»
  10 «Kanda, Paulo Afonso de Medeiros^1University of São Paulo
      ^2School of Medicine^3Cognitive Disorders of Clinicas
      Hospital Reference Center^pBrasil ^cSão Paulo^rorg»
  10 «Anghinah, Renato^1University of São Paulo^2School of
      Medicine ^3Cognitive Disorders of Clinicas Hospital
      Reference Center ^pBrasil^cSão Paulo^rorg»
  12 «The Clinical use of quantitative EEG in cognitive disorders
      ^ien»
  12 «A utilização clínica do EEG quantitativo nos transtornos
      cognitivos^ipt»
  30 «Dement. neuropsychol»
  31 «3»
  32 «3»
  35 «1980-5764»
</pre>

Tag #2 is the LILACS identifier, a non-repeating, numeric field, and tag #10 is an author field in an analytic (article-level), record. The LILACS data dictionary (<a href="#bireme2008">BIREME, 2008</a>) defines #10 as a repeating field. In addition, it is composed of the following subfields:

<table> <tr> 	<th>subfield</th> 	<th>description</th> 	<th>sample content</th> </tr> <tr> 	<th>main</th> 	<td>author name</td> 	<td>Kanda, Paulo Afonso de Medeiros</td> </tr> <tr> 	<th>^1</th> 	<td>affiliation level 1</td> 	<td>University of São Paulo</td> </tr> <tr> 	<th>^2</th> 	<td>affiliation level 2</td> 	<td>School of Medicine</td> </tr> <tr> 	<th>^3</th> 	<td>affiliation level 3</td> 	<td>Cognitive Disorders of Clinicas Hospital Reference Center</td> </tr> <tr> 	<th>^p</th> 	<td>country</td> 	<td>Brasil</td> </tr> <tr> 	<th>^c</th> 	<td>city</td> 	<td>São Paulo</td> </tr> <tr> 	<th>^r</th> 	<td>degree of responsibility</td> 	<td>org (organizer)</td> </tr> </table>

One way to represent such a record in JSON is shown below (only fields #10 and #2 are shown; some contents shortened for clarity):

<pre>[sourcecode language='javascript']
{
    "10": [
        {
            "_": "Kanda, Paulo Afonso",
            "1": "University of São Paulo",
            "2": "School of Medicine",
            "3": "Cognitive Disorders of … Reference Center",
            "c": "São Paulo",
            "p": "Brasil",
            "r": "org"
        },
        {
            "_": "Smidth, Magali Taino",
            "1": "University of São Paulo",
            "2": "School of Medicine",
            "3": "Cognitive Disorders of … Reference Center",
            "c": "São Paulo",
            "p": "Brasil",
            "r": "org"
        }
    ],
    "2": [
        {
            "_": "538886"
        }
    ]
}
[/sourcecode]</pre>

In the representation above, the record is a JSON mapping between tags (eg. “10”) and lists of values (or occurrences in ISIS jargon). The use of lists allows any field to be repeatable, therefore any possible ISIS record fits this scheme. Also, within the list of occurrences, each one is represented by a mapping of subfield keys and values. The special key “_” (underscore) is used to denote the main subfield, and when there are no subfields that is the only key. This creates unnecessary nesting, as demonstrated by field #2 which is the LILACS identifier field, non-repeating and devoid of subfields as defined by the LILACS data dictionary. But it does make the structure homogeneous, an important feature when dealing with legacy semistructured datasets where not all records adhere to the current schema. During our research we called the representation above ISIS-JSON type 3 (<a href="#bireme2010">BIREME, 2010</a>).

The main drawback of ISIS-JSON type 3 is that, by definition, JSON mappings are unordered<sup><a name="noteorig9" href="#note9">9</a></sup>. Therefore, neither the tag order within the record nor the subfield order within a field occurrence are preserved. In the LILACS database, tag order within a record is only relevant when there are repeating tags, such as #10 (author, analytical level). Fortunately, keeping repeated occurrences in a list does preserve their relative ordering, even if the overall field order is not kept. For example, in the original ISIS record under analysis field #2 appeared before all occurrences of field #10, while in the JSON representation the key “10” precedes key “2”. This is irrelevant in practice. But, crucially, the order of the authors is the same in both formats: first “Kanda, Paulo Afonso”, then “Smidth, Magali Taino”.

Regarding subfield ordering within a field, the LILACS data dictionary does establish a canonical ordering. For example, in the case of field #10 the order is _, 1, 2, 3, p, c, r (the main subfield is always first, by definition). However, as Jason Thomale points out in “Interpreting MARC: Where’s the Bibliographic Data?” (<a href="#thomale2010">Thomale, 2010</a>), subfield markers should be interpreted as textual markup, and their ordering is significant. Particularly when cleaning up or converting legacy data, maintaining the order of subfields from the original record may be important to ascertain the original cataloger's intention, even if LILACS subfields do have a prescribed ordering.

To preserve subfield ordering, two alternative JSON representations have been tried at BIREME/PAHO/WHO. The first is ISIS-JSON type 1 (line breaks added within long strings for clarity):

<pre>[sourcecode language='javascript']
{
    "10": [
        "Kanda, Paulo Afonso de Medeiros^1University of São Paulo
          ^2School of Medicine^3Cognitive Disorders of Clinicas
          Hospital Reference Center^pBrasil ^cSão Paulo^rorg",
        "Smidth, Magali Taino^1University of São Paulo ^2School
          of Medicine^3Cognitive Disorders of Clinicas Hospital
          Reference Center^pBrasil ^cSão Paulo^rorg"
    ],
    "2": ["538886"]
}
[/sourcecode]</pre>

In this format, each field occurrence is a single string, with the subfield delimiters embedded. This essentially “punts” on the issue of how to represent subfields, leaving the parsing to be done by the database. Because CouchDB uses JavaScript as its default language for creating indexes, it is a viable approach. Another possibility is ISIS-JSON type 2:

<pre>[sourcecode language='javascript']
{
    "10": [
        [
            ["_", "Kanda, Paulo Afonso"],
            ["1", "University of São Paulo"],
            ["2", "School of Medicine"],
            ["3", "Cognitive Disorders of … Reference Center"],
            ["p", "Brasil"],
            ["c", "São Paulo"],
            ["r", "org"]
        ],
        [
            ["_", "Smidth, Magali Taino"],
            ["1", "University of São Paulo"],
            ["2", "School of Medicine"],
            ["3", "Cognitive Disorders of … Reference Center"],
            ["p", "Brasil"],
            ["c", "São Paulo"],
            ["r", "org"]
        ]
    ],
    "2": [
        [
            ["_", "538886"]
        ]
    ]
}
[/sourcecode]</pre>

Here each field occurrence is an association list<sup><a name="noteorig10" href="#note10">10</a></sup>, that is, a mapping represented as a list of lists, where each inner list contains a key-value pair. The advantage of an associative list over a JSON object mapping is that the order of items is preserved, but at the cost of a linear search to locate a key within a field occurrence. Because of how indexing works in CouchDB, this is not as costly as it may seem, as will be seen in the next section. For MARC-style records, the association list has the added advantage of allowing repeated mappings to represent repeating subfields.

<h3>4.2. The isis2json conversion tool</h3>

To convert ISIS records into JSON structures, we developed a Python script called <code>isis2json.py</code> (<a href="#bireme2010b">BIREME, 2010b</a>). It can be executed with both the Python and Jython interpreters, versions 2.5 through 2.7. When running under Python it can read only ISO-2709 files, but as a Jython script it leverages the Bruma Java library (<a href="#barbieri2011">Barbieri, 2011</a>) and can also read binary ISIS files in .MST format directly. Several options control the structure of the JSON output. For example, the command line below generates output suitable for batch importing to CouchDB:

<pre>
$ ./isis2json.py cds.iso -c -t 3 -q 100 > cds1.json
</pre>

The arguments used in the above example are:

<code>-c</code> to generate records within a “docs” list, as required by the CouchDB <code>_bulk_docs</code> API;

<code>-t 3</code> for ISIS-JSON type 3 export (fields as dictionaries of subfields);

<code>-q 100</code> to output only 100 records;

Here is the help screen of isis2json.py:

<pre>
$ ./isis2json.py -h
usage: isis2json.py [-h] [-o OUTPUT.json] [-c] [-m] [-t ISIS_JSON_TYPE]
                    [-q QTY] [-s SKIP] [-i TAG_NUMBER] [-u] [-p PREFIX] [-n]
                    [-k TAG:VALUE]
                    INPUT.(mst|iso)

Convert an ISIS .mst or .iso file to a JSON array

positional arguments:
  INPUT.(mst|iso)       .mst or .iso file to read

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT.json, --out OUTPUT.json
                        the file where the JSON output should be written
                        (default: write to stdout)
  -c, --couch           output array within a "docs" item in a JSON document
                        for bulk insert to CouchDB via POST to db/_bulk_docs
  -m, --mongo           output individual records as separate JSON
                        dictionaries, one per line for bulk insert to MongoDB
                        via mongoimport utility
  -t ISIS_JSON_TYPE, --type ISIS_JSON_TYPE
                        ISIS-JSON type, sets field structure: 1=string,
                        2=alist, 3=dict
  -q QTY, --qty QTY     maximum quantity of records to read (default=ALL)
  -s SKIP, --skip SKIP  records to skip from start of .mst (default=0)
  -i TAG_NUMBER, --id TAG_NUMBER
                        generate an "_id" from the given unique TAG field
                        number for each record
  -u, --uuid            generate an "_id" with a random UUID for each record
  -p PREFIX, --prefix PREFIX
                        concatenate prefix to every numeric field tag (ex. 99
                        becomes "v99")
  -n, --mfn             generate an "_id" from the MFN of each record
                        (available only for .mst input)
  -k TAG:VALUE, --constant TAG:VALUE
                        Include a constant tag:value in every record (ex. -k
                        type:AS)
</pre>



<h3>4.3. Interacting with CouchDB</h3>

HTTP is the main protocol for interacting with CouchDB<sup><a name="noteorig11" href="#note11">11</a></sup>. Most database administration is done via Futon, its built-in web interface.

<div style="width: 500px; margin: auto;"><a href="/media/issue13/ramalho/fig1-futon.png"><img class="caption" src="/media/issue13/ramalho/fig1-futon-500x422.png" alt="Futon screen shot: viewing a record" /></a></div> <p class="caption"><strong>Figure 1.</strong> CouchDB Futon web interface showing a LILACS record in ISIS-JSON Type-2 format.</p>

Nearly every task which can be done via Futon can also be automated using the RESTful API of CouchDB via any HTTP client. In fact, interacting with CouchDB is a great way to learn, in practice, what REST and RESTful interfaces are all about. Here we use cURL, a command-line HTTP client available for many systems. For example, a database is created with a simple PUT request:

<pre>
$ curl -X PUT http://admin_user:password@127.0.0.1:5984/lilacs
</pre>

One operation that cannot be done via Futon is uploading records in bulk. Continuing the previous example, we could convert an ISO-2709 file to JSON, then upload it to CouchDB using cURL in two steps:

<pre>
$ ./isis2json.py cds.iso -c > cds1.json
$ curl -d cds1.json -H"Content-Type: application/json" \
       -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
</pre>

When used together, the <code>-q</code> and the <code>-s</code> (skip) options allow splitting the output into several batches, which results in faster loading for large datasets on multi-core servers. For example, loading 20,000 full LILACS records, totaling 64MB, took 147 seconds using this shell command:

<pre>
$ isis2json.py -c -p v -i 2 -q 20000 lilacs100k.iso | \
  curl -d @- -H"Content-Type: application/json" \
       -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
</pre>

Loading the same data in two 10,000-record batches took 100 seconds when taking advantage of two cores with a simple shell script like this one:

<pre>
#!/bin/bash
isis2json.py -c -p v -i 2 -q 10000 lilacs100k.iso | \
curl -d @- -H"Content-Type: application/json" \
     -X POST http://127.0.0.1:5984/lilacs/_bulk_docs &
isis2json.py -c -p v -i 2 -q 10000 -s 10000 lilacs100k.iso | \
curl -d @- -H"Content-Type: application/json" \
     -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
</pre>

If an <code>_id</code> attribute is not present in an inserted document, CouchDB provides one with a UUID (Universally Unique Identifier) like <code>"ead3af23a4459b2d7a1aef05cb0012a9"</code>. It is highly recommended that an <code>_id</code> is given when adding documents to prevent inadvertent duplication of records if a bulk loading process is interrupted and restarted. Therefore <code>isis2json.py</code> provides the <code>-i</code> option, used in the examples above, to fetch the value of one field in the ISIS input and use it as the <code>_id</code> attribute.

So, for example, using the <code>-i 2</code> option this (partial) ISIS structure:

<pre>
   1 «BR1.1»
   2 «538886»
   4 «LILACS»
   4 «LLXPEDT»
</pre>

Can be converted into this JSON document record, ready to import into CouchDB:

<pre>[sourcecode language='javascript']
{
    "_id": "538886",
    "1": [
        [
            ["_", "BR1.1"]
        ]
    ],
    "2": [
        [
            ["_", "538886"]
        ]
    ],
    "4": [
        [
            ["_", "LILACS"]
        ],
        [
            ["_", "LLXPEDT"]
        ]
    ]
}
[/sourcecode]</pre>

Note how the content of the first occurrence of the #2 tag was used as <code>_id</code>.

<h2><a name="section5"></a>5. Indexing and querying</h2>

<h3>5.1. Building indexes</h3>

For efficient retrieval, records must be indexed. It is useful to distinguish between a primary index and secondary indexes. A primary index allows retrieval via the primary key. Every NoSQL database uses a primary index to allow queries on the primary key. In fact, “key-value” databases often provide only this means of retrieval. Document databases also support secondary indexes, allowing fast access via arbitrary attributes of the records. In order to index those attributes, some form of data extraction must be performed on the records. By default, JavaScript is the language used for data extraction in CouchDB<sup><a name="noteorig12" href="#note12">12</a></sup>. In ISIS databases, a simple but terse “ISIS Formatting Language” is used to generate secondary indexes, also called inverted files.

Index definition in ISIS is done in a special file called the “Field Select Table” (FST). Here are just a few lines from the large FST used to generate indexes for the LILACS database:

<pre>
98 0 v1,"-"v2
10 8 ,mpl,'|AU_|'(v10^*|%|/),
38 0 if s(mpu,v38^a,mpl):'CD' then 'SP_CD-ROM' fi,
</pre>

Without going into all the details, here is what those lines do (the numbering is not relevant to this discussion<sup><a name="noteorig13" href="#note13">13</a></sup>):

<ul> <li>line 98 indexes the contents of tags #1 (cooperating center code) and #2 (record id) concatenated with an hyphen;</li> <li>line 10 indexes each word of the main subfield of field #10 (author), concatenated with an <code>'AU_'</code> prefix;</li> <li>line 38 indexes the string 'SP_CD-ROM' if field 38, subfield ^a contains the string 'CD';</li> </ul>

To generate an inverted file, an ISIS system applies each line of the FST to each record in the database, generating one or more index entries (or postings) per record. Documents can also be indexed incrementally, but BIREME/PAHO/WHO does a lot of batch processing and usually does “full inversion”, that is, regenerating inverted files from scratch.

The overall process is very similar in CouchDB. Instead of an FST, CouchDB allows us to define “views”, which are the result of running JavaScript functions to generate indexes. The simplest view contains only one function, the <code>map</code> function, which receives a document as an argument, extracts some data from it, and calls a special <code>emit</code> function to add entries to the index.

The <code>emit</code> function takes two arguments: <code>key</code> and <code>value</code>. The <code>key</code> argument is the one actually indexed; queries will be made on it. It is usually unstructured, like a string, but sometimes it is useful to have an array as a key, for instance <code>[country, province]</code>, to allow multi-level grouping of results. The <code>value</code> may be any JSON structure, and it is used to display the results of a query. For example, in an OPAC application we may have an index where each entry has an ISBN as a key, and the book title and year of publication as the value.

If you know some SQL, you may find it easier to think in terms of a simple SELECT statement:

<pre>[sourcecode language='sql']
SELECT title, year FROM books WHERE isbn='9781594201820'
[/sourcecode]</pre>

Here we have a <code>books</code> table with <code>title</code>, <code>year</code> and <code>isbn</code> fields. To support a query like the one above we would need an index on <code>isbn</code>. The <code>title</code> and <code>year</code> fields have no influence on the search, but they are mentioned in the query because we want to display them in the search result.

In CouchDB terms, to support a similar query we would need a view with a <code>map</code> function emitting the ISBN as key, and the value would be a structure like <code>{"title":"Lords of Finance", "year":2009}</code>. The <code>map</code> function for that view would look like this:

<pre>[sourcecode language='javascript']
function(doc) {
  if (doc.type == "book") {
    emit(doc.isbn, {title: doc.title, year: doc.year});
  }
}
[/sourcecode]</pre>

Because CouchDB has no concept of tables like SQL databases have, it is common to have different types of records mixed in the same database. Therefore, <code>map</code> functions often have an <code>if</code> statement to select which records to index. In the example above, we index only documents of type <code>"book"</code>. For each book, the key will be the ISBN and the value will be an object with the <code>title</code> and <code>year</code> attributes.

In addition to a <code>map</code> function, CouchDB views may have a <code>reduce</code> function, which is used to aggregate results. Views with <code>reduce</code> functions will be discussed later.

<h3>Indexing ISIS-JSON type 2 records: first approach</h3>

Being able to use a powerful language like JavaScript to create <code>map</code> functions gives a lot of flexibility when creating views. We can dig as deep as necessary into the structure of our documents, and massage the data we find in sophisticated ways. For one report, we created an index of all the LILACS records containing fields with repeating subfield markers. That would not be viable in an ISIS FST. However, one important limitation of <code>map</code> functions is that we have no access to data that is not part of the document being indexed. It is impossible, for example, to look up some value in another record, like a join operation in an RDBMS (ISIS also allows this, via the <code>l/lookup</code> function of the ISIS Formatting Language).

To start with a simple example, let us create a view to index LILACS records by tag #1 (cooperating center code), the id of the cataloging institution which created the record. From the LILACS data dictionary we know that tag #1 is non-repeating and has no subfields, which means all the content of that field resides in the main subfield (the first one) and only the first occurrence matters (because there are no repetitions). Here is a view function that would create an index with field #1 as key and field #12 (title, analytic) as value:

<pre>[sourcecode language='javascript']
function(doc) {
  emit(doc["1"][0][0][1], doc["12"][0][0][1]);
}
[/sourcecode]</pre>

Here we begin to see the price we pay for the generality of the ISIS-JSON type 2 structure. The LILACS field #1 is non-repeating and has no subfields, so it could be represented like this:

<pre>[sourcecode language='javascript']
{
    "1" : "UY6.1",
    ...
}
[/sourcecode]</pre>

And then we could reach it within a <code>map</code> function with the simple expression <code>doc["1"]</code>. Even better, we could add an alpha prefix to the field name (the <code>--prefix</code> option of <code>isis2json.py</code> does that). Then we could use dot notation (<code>doc.v1</code>) to access tag v1 in the following:

<pre>[sourcecode language='javascript']
{
    "v1" : "UY6.1",
    ...
}
[/sourcecode]</pre>

However, to deal with ISIS records of any kind in the absence of schema information, we must assume that every field may have more than one occurrence, therefore its value cannot be just a string, but an array of occurrences. In addition, we must assume that every field may have subfields, therefore each occurrence must be structured as an associative list (ISIS-JSON type 2) or a dictionary (ISIS-JSON type 3), unless we want to parse the subfields every time when indexing. In an ISIS-JSON type 2 record, field #1 is represented as:

<pre>[sourcecode language='javascript']
{
    "1": [
        [
            [
                "_",
                "UY6.1"
            ]
        ]
    ],
    ...
}
[/sourcecode]</pre>

And to access its value we must write <code>doc["1"][0][0][1]</code>. Here is why:

<pre>
doc   ->   {"1": [[["_", "UY6.1"]]]}     # entire document
doc["1"]   ->   [[["_", "UY6.1"]]]     # list of occurrences of field #1
doc["1"][0]  ->  [["_", "UY6.1"]]    # list of subfields of occurrence 0 of #1
doc["1"][0][0] -> ["_", "UY6.1"]   # subfield 0 of occurrence 0 of #1 (key "_")
doc["1"][0][0][1]  ->  "UY6.1"   # value of subfield 0 of occurrence 0 of #1
</pre>



This quickly becomes tedious, and very burdensome in more complex cases, for instance, to retrieve a specific subfield aside from the main one we would need to iterate over the subfield keys. We will soon show a library to make it easier to handle ISIS-JSON type 2 records, but first let's return to our simple <code>map</code> function:

<pre>[sourcecode language='javascript']
function(doc) {
  emit(doc["1"][0][0][1], doc["12"][0][0][1]);
}
[/sourcecode]</pre>

This function is part of a view in a "design document" in CouchDB. A design document is written in JSON, stored in CouchDB as other documents are, but its identifier starts with <code>_design/</code>. It may contain JavaScript functions for creating views as well as formatting output, validating document inserts/updates, etc. Each CouchDB database may have several design documents, and each design document may have several views.

When trying out CouchDB initially, the easiest way to create a view is by using the "Temporary view..." option of the view dropdown in the top right area of the Futon interface. Then you can run the <code>map/reduce</code> functions and quickly see their results (if your dataset is not too large). To save your work to a permanent view, you will be prompted to provide a design document name and a view name.

For any serious work, the CouchApp tool is highly recommended (<a href="#anderson2009">Anderson, 2009</a>). With it you can develop your design documents in your local filesystem, using your favorite editor and version control system, then push your code to a local or remote CouchDB instance with a command. All of the views for this article were developed in this manner. The code is in Bitbucket (<a href="#ramalho2011">Ramalho, 2011</a>).

<div style="width: 500px; margin: auto;"><a href="/media/issue13/ramalho/fig2-tempview.png"><img class="caption" src="/media/issue13/ramalho/fig2-tempview-500x399.png" alt="Futon screen shot: testing a temporary view" /></a></div> <p class="caption"><strong>Figure 2.</strong> Editing and running a temporary view in CouchDB Futon.</p>

When a view is first visited via HTTP, CouchDB indexes all of the documents by applying the <code>map</code> function to each of them. CouchDB also incrementally updates the indexes if documents are inserted or updated. The indexing is only done on demand, when a view is actually requested, and not when documents are created or changed.

To install the <code>map</code> function above, we created a design document called <code>lilacs</code> and within it a view called <code>center</code> (for "cooperating center"). Here is part of the result of requesting that view with the cURL utility:

<pre>
$ curl -s http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center | head -5
{"total_rows":926,"offset":0,"rows":[
{"id":"559682","key":"AR1.1","value":"Hemorragia digestiva baja: [revisi\u00f3n]"},
{"id":"559683","key":"AR1.1","value":"La radiolog\u00eda del colon por enema en el siglo XXI"},
{"id":"559684","key":"AR1.1","value":"El s\u00edndrome an\u00e9mico en las neoplasias del colon derecho"},
{"id":"559685","key":"AR1.1","value":"Met\u00e1stasis cerebrales del carcinoma epidermoide del ano: comunicaci\u00f3n de un caso"},
</pre>

Note that the result comes as a JSON object with three properties: <code>total_rows</code>, <code>offset</code> and <code>rows</code>, the latter being an array of <code>map</code> function results. Besides the key and value generated by the <code>emit</code> function, each result row also has an <code>id</code> attribute, which carries the <code>_id</code> property of the corresponding indexed document. By default, the result is sorted in ascending key order. Here we used the <code>head</code> shell command to crop the displayed results; we will soon see a way of limiting the results actually sent by CouchDB.

By the way, the URL shown here is public and you should be able to access it to try the examples by editing the URL:

<a href="http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center" >http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center</a>

However, please note that the <code>lilcouch</code> database there contains only a sample of 1000 records, not the full LILACS database.

<h3>5.2. Querying a view</h3>

As we have just seen, CouchDB queries are executed by making HTTP requests on views. If no arguments are passed, all rows of the result are returned. However, there are several arguments that can be used to filter the results. For example, the following query uses the <code>descending</code> and <code>limit</code> arguments. Note the use of single quotes around the URL, necessary because "&" is a shell operator and we are using the command line. In this case the <code>head</code> shell command was not used. The <code>total_rows</code> property still counts 926, but only the last three rows are returned because of the <code>limit=3</code> option.

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?descending=true&limit=3'
{"total_rows":926,"offset":0,"rows":[
{"id":"560003","key":"UY6.1","value":"Vacunaci\u00f3n antigripal en personal de salud del Hospital Pedi\u00e1trico del Centro Hospitalario Pereira Rossell: evoluci\u00f3n de la cobertura del a\u00f1o 2006 al 2008"},
{"id":"560002","key":"UY6.1","value":"Contin\u00faa descendiendo la mortalidadpor asma en Uruguay: per\u00edodo 1984-2008"},
{"id":"560001","key":"UY6.1","value":"Localizaci\u00f3n de lesiones mamarias subcl\u00ednicas con marcador met\u00e1lico (arponaje): an\u00e1lisis de los m\u00e1rgenes quir\u00fargicos"}
]}
</pre>

Another possibility is to filter by key. The next query returns only the documents created by the cooperating center with code "CO113":

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?key="CO113"'
{"total_rows":926,"offset":744,"rows":[
{"id":"559986","key":"CO113","value":"De serendipia al tratamiento quir\u00fargico de la diabetes mellitus tipo II"},
{"id":"559987","key":"CO113","value":"La Academia Nacional de Medicina se pronuncia sobre la emergencia social, la medicina y la salud"},
{"id":"559988","key":"CO113","value":"Presentaci\u00f3n inicial de las pacientes con diagn\u00f3stico de c\u00e1ncer de seno en el Centro Javeriano de Oncolog\u00eda, Hospital Universitario San Ignacio"},
{"id":"559989","key":"CO113","value":"\u00bfEs la diabetes mellitus tipo 2 una enfermedad de tratamiento quir\u00fargico?"},
{"id":"559990","key":"CO113","value":"P\u00f3lipos de la ves\u00edcula"},
{"id":"559991","key":"CO113","value":"Ruptura espl\u00e9nica espont\u00e1nea asociada a linfoma perif\u00e9rico de c\u00e9lulas T, presentaci\u00f3n de un caso y revisi\u00f3n de la literatura"},
{"id":"559992","key":"CO113","value":"Diverticulosis del yeyuno: complicaciones y manejo; reporte de caso y revisi\u00f3n de la literatura"},
{"id":"559993","key":"CO113","value":"Actinomicosis abdominal y p\u00e9lvica: reto diagn\u00f3stico y quir\u00fargico para el cirujano general"}
]}
</pre>

It is also possible to limit by starting and ending keys. For example, this query returns the documents created by centers starting with the "CO" prefix (for Colombia), up to and including the "CO149" center. Note that the <code>offset</code> property tells us that 744 rows were skipped to get to the first one with <code>key="CO"</code>:

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?startkey="CO"&endkey="CO149"'
{"total_rows":926,"offset":744,"rows":[
{"id":"559986","key":"CO113","value":"De serendipia al tratamiento quir\u00fargico de la diabetes mellitus tipo II"},
{"id":"559987","key":"CO113","value":"La Academia Nacional de Medicina se pronuncia sobre la emergencia social, la medicina y la salud"},
[... several rows omitted ...]
{"id":"560469","key":"CO149","value":"Diagn\u00f3stico sanitario de diversos zoocriaderos helic\u00edcolas en Colombia: determinaci\u00f3n de los principales agentes pat\u00f3genos que afectan el caracol Helix aspersa (O.F. Muller, 1774) en cada etapa de ciclo biol\u00f3gico"},
{"id":"560470","key":"CO149","value":"La salud animal y la globalizaci\u00f3n: el desaf\u00edo de pol\u00edticas sostenibles y equitativas en el contexto de los pa\u00edses en desarrollo"}
]}
</pre>

<h3>5.3. Indexing ISIS-JSON type 2 records: using an API</h3>

Now we will develop a view called <code>au_countries</code> to list the countries of origin of the authors cited in the sample. This entails fetching the value of the "^p" subfield of field #10. A first stab at the <code>map</code> function came out like this:

<pre>[sourcecode language='javascript']
function(doc) {
  var occur, subf;
  if (doc.hasOwnProperty('10')) {
    for (occur=0; occur&lt;doc['10'].length; occur++) {
      for (subf=0; subf&lt;doc['10'][occur].length; subf++) {
        if (doc['10'][occur][subf][0] === 'p') {
          emit(doc['10'][occur][subf][1], 1);
          break;
        }
      }
    }
  } else {
    emit(null, 1);
  }
}
[/sourcecode]</pre>

Yes, that is quite ugly. Much of the ugliness comes from the fact that ISIS-JSON type 2 arranges subfields in a associative list, where a key can only be found through a linear search.

So here is the next solution, much simpler thanks to the use of a custom library:

<pre>[sourcecode language='javascript']
function(doc) {
  // !code vendor/isisdm_t2/_attachments/isisdm.js
  var i, occurs = ISIS.getallsub(doc, 10, "p");
  for (i=0; i&lt;occurs.length; i++) {
    if (occurs[i] !== undefined) {
      emit(occurs[i], 1);
    }
  }
}
[/sourcecode]</pre>

The first line of the function is a special <code>!code</code> comment that works as an include, and is processed by the CouchApp utility. This is actually a workaround because at this time CouchDB does not support the use of the CommonJS <code>require</code> function to import modules in <code>map</code> or <code>reduce</code> functions (<code>require</code> is supported elsewhere in CouchDB, in show and list functions for example). The <code>isisdm.js</code> module used here is part of the ISIS Data Model APIs we are developing at BIREME/PAHO/WHO to handle ISIS records in JavaScript and Python. Currently <code>isisdm.js</code> defines four main functions:

<dl> <dt><code>get(record, tag, missing)</code></dt>     <dd>returns first occurrence of <code>tag</code> in <code>record</code>; if     <code>tag</code> is not found, returns the value of the last argument or     <code>undefined</code> if that is not given;</dd> <dt><code>getall(record, tag)</code></dt>     <dd>returns an array with all occurrences of <code>tag</code> in     <code>record</code> or an empty array if <code>tag</code> is not found;</dd> <dt><code>getsub(record, tag, key, missing)</code></dt>     <dd>returns subfield identified by <code>key</code> in the first occurrence     of <code>tag</code> in <code>record</code>; if <code>tag</code> or     <code>key</code> are not found, returns the value of the last argument or     <code>undefined</code> if that is not given;</dd> <dt><code>getallsub(record, tag, key, missing)</code></dt>     <dd>returns an array with contents of subfield identified by     <code>key</code> in all occurrences of <code>tag</code> in     <code>record</code>;</dd> </dl>

We are using QUnit (<a href="#jquery2011">JQuery, 2011</a>) to perform unit tests on the <code>isisdm.js</code> module. Here is the the test page:

<div style="width: 500px; margin: auto;"><a href="/media/issue13/ramalho/fig3-qunit.png"><img class="caption" src="/media/issue13/ramalho/fig3-qunit-500x394.png" alt="Futon screen shot: viewing a record" /></a></div> <p class="caption"><strong>Figure 3.</strong> QUnit test results page, showing that all tests are passing. The tests for the <code>getallsub</code> function are expanded. This page is live at <a href="http://ramalho.couchone.com/lilcouch/_design/lilacs/tests.html" >http://ramalho.couchone.com/lilcouch/_design/lilacs/tests.html</a></p>

Now, back to the <code>map</code> function, note that the call to <code>emit</code> uses the subfield occurrence as key and the value is just a number 1: <code>emit(occurs[i], 1);</code>. This is because the intent of this view is to produce an aggregate count of each different key. To achieve this, we need a <code>reduce</code> function to sum the values emitted, like this:

<pre>[sourcecode language='javascript']
function(keys, values, rereduce) {
  return sum(values);
}
[/sourcecode]</pre>

Now that our <code>au_countries</code> view has both a <code>map</code> and <code>reduce</code> function, we can query it:

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries'
{"rows":[
{"key":null,"value":2368}
]}
</pre>

The simplest, no-argument request returns null as the key and 2368 as the value. This is the result of the <code>reduce</code> function: 2368 is the sum of all the 1's emitted by the <code>map</code> function. In other words, there are 2368 occurrences of the #10 tag with a <code>^p</code> subfield.

The <code>reduce</code> function can be disabled if the <code>reduce=false</code> option is given in the request. Below are the first 5 and the last 5 lines of querying the <code>au_countries</code> view without applying the <code>reduce</code>. What you see is just the result of the <code>map</code> function:

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries?reduce=false' | head -5
{"total_rows":2368,"offset":0,"rows":[
{"id":"560162","key":"\u00c1frica do Sul","value":1},
{"id":"560162","key":"\u00c1frica do Sul","value":1},
{"id":"560013","key":"Alemania","value":1},
{"id":"559596","key":"Algeria","value":1},
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries?reduce=false' | tail -5
{"id":"560034","key":"USA","value":1},
{"id":"560325","key":"USA","value":1},
{"id":"559604","key":"Venezuela","value":1},
{"id":"559957","key":"Venezuela","value":1}
]}
</pre>

A more interesting result can be obtained if we pass the <code>group=true</code> option:

<pre>$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries?group=true&limit=10'
{"rows":[
{"key":"\u00c1frica do Sul","value":2},
{"key":"Alemania","value":1},
{"key":"Algeria","value":1},
{"key":"Argentina","value":289},
{"key":"Austr\u00e1lia","value":2},
{"key":"Bolivia","value":1},
{"key":"Brasi\u00e7","value":1},
{"key":"Brasil","value":1185},
{"key":"Brazil","value":26},
{"key":"Brsil","value":1}
]}
</pre>

The result set above, limited to 10 rows, shows the number of occurrences of each key. Obviously there are four entries for Brazil, with alternate and incorrect spellings. This would have to be dealt with elsewhere. The key point here is that the <code>reduce</code> function and the group option work together to produce aggregate results, similar to the ones we can produce with the SQL GROUP BY clause in a relational DBMS.

<h2><a name="section6"></a>6. Results and Conclusion</h2>

<h3>6.1. Results</h3>

We have identified in CouchDB and MongoDB two modern, Open Source database systems which are suitable for semistructured records like those defined by the ISO-2709 standard and the ISIS family of systems, serving the needs of MARC and LILACS datasets.

Furthermore, we created a tool to convert ISIS records from the ISO-2709 format to JSON documents suitable for loading into CouchDB (or MongoDB, though we have not shown that in this paper).

We considered a number of alternative representations for ISIS data in JSON format. In this paper we used the type 2 representation which, though somewhat awkward to work with, preserves subfield ordering and allows for repeated subfields, a feature of MARC records. ISIS fields do not have indicators, so we have not discussed how to represent them in JSON. One possibility would be to use special subfield markers, like "_1" and "_2", attached to field occurrences where the indicators are present.

Finally, we developed a JavaScript library to make it easy to handle ISIS-JSON type 2 records when creating CouchDB views.

<h3>6.2. Conclusion</h3>

These experiments and developments have shown that it is easy to import ISO-2709 data into a document database like CouchDB or MongoDB. After doing so, it becomes almost trivial to create Web services to publish the data, using just JavaScript, particularly in CouchDB thanks to its native support for JSON over HTTP.

While the semistructured data model was only formalized in the mid 1990's, the ISO-2709 and ISIS record formats have always been concrete, albeit limited, examples of it. Research into that model includes results such as algorithms to extract a formal schema from actual datasets, methods for dealing with shared or duplicate data, and a normal form adapted to semistructured schemas (<a href="tok2005">Tok, 2005</a>). We have much to learn and apply from semistructured data research into our daily work with bibliographic records.

At BIREME/PAHO/WHO we continue investigating the challenges and opportunities of converting from the ISIS legacy systems to modern Open Source document databases. Meanwhile, we are also developing new applications – not limited to the ISIS data model and legacy data – using CouchDB, JavaScript, Python and the Pyramid framework. Pyramid is a new web framework resulting from the merger of the Pylons and Repoze projects, and it includes Deform, a form generation and validation library with strong support for repeating fields and nested forms, important features when building interfaces for semistructured data entry (<a href="#pylons2011">Pylons, 2011</a>). These new developments allow us to think about how we want the LILACS bibliographic database to operate in the year 2015, when its 30th anniversary will be celebrated.

<h2>Notes</h2>

<a name="note1" href="#noteorig1"><sup>1</sup></a> The qualification “<em>flat</em> relational model” is used here because there is actually a controversy about the meaning of the First Normal Form. C. J. Date states that the atomicity requirement of the 1NF is pointless, and that since “types can be anything” it follows that “all relations are in first normal form by definition” (<a href="#date2005">Date, 2005</a>, p. 37). But E. F. Codd did define normalization as the process of removing non-simple domains from relations (<a href="#codd1970">Codd, 1970</a>), and current database textbooks define the 1NF as Codd did (<a href="#elmasri2006">Elmasri, 2006</a>; <a href="#silberschatz2006">Silberschatz, 2006</a>). Leaving aside the theory, although some RDBMS – like PostgreSQL and Oracle – support array values and even user-defined composite field values, modern database-independent access methods, such as the Ruby on Rails ActiveRecord ORM, often target the lowest common denominator, and that is the flat relational model.

<a name="note2" href="#noteorig2"><sup>2</sup></a> “The structure, or 'empty container', the content designators (tags, indicators, and subfield codes) used to explicitly identify or additionally characterize the data elements, and the content, the data itself (author's name, titles, etc.) are the three components of the [MARC II] format.” (<a href="#avram1975">Avram, 1975</a>)

<a name="note3" href="#noteorig3"><sup>3</sup></a> The term “semistructured” often appears as “semi-structured” in the literature. Searching for both spellings is recommended because Google returns more results for the hyphenated spelling, but some key proponents of the model use just one word (<a href="#abiteboul1999">Abiteboul, 1999</a>; <a href="#tok2005">Tok, 2005</a>; <a href="#suciu2001">Suciu, 2001</a>). Here I use one word, but kept the source spelling in quotations.

<a name="note4" href="#noteorig4"><sup>4</sup></a> JSON is JavaScript Object Notation, a data exchange format described as a light-weight alternative do XML (<a href="#crockford2006b">Crockford, 2006b</a>). Since 2006 JSON is an Internet standard, formalized by RFC4627 (<a href="#crockford2006a">Crockford, 2006a</a>).

<a name="note5" href="#noteorig5"><sup>5</sup></a> In MARC records, indicators are two single-digit positions located between the tag number and the field content, which have different uses depending on the tag number.

<a name="note6" href="#noteorig6"><sup>6</sup></a> ISIS datasets sometimes contain repeated subfields due to typos and processing errors. However, the ISIS Formatting Language, used to generate indexes and displays, is only capable of extracting the first occurrence of a subfield.

<a name="note7" href="#noteorig7"><sup>7</sup></a> From the XML standard (<a href="#w3c1998">W3C, 1998</a>): “An element type has mixed content when elements of that type may contain character data, optionally interspersed with child elements.” Mixed content also exists in HTML and in SGML, their common ancestor.

<a name="note8" href="#noteorig8"><sup>8</sup></a> Here we mean “durability” as in the “D” in the ACID database properties, meaning that once an insert or update is committed, it is written to disk. For a discussion on why the designers of MongoDB initially compromised on single-server durability, see “What about Durability?“ (<a href="#mongodb2010">MongoDB, 2010</a>). Update: as of March 16, 2011, MongoDB 1.8 is released and includes write-ahead journaling. “With journaling enabled, crash recovery is fast and safe” (<a href="#mongodb2011">MongoDB, 2011</a>).

<a name="note9" href="#noteorig9"><sup>9</sup></a> “An object is an unordered collection of zero or more name/value pairs” (<a href="#crockford2006a">Crockford, 2006a</a>).

<a name="note10" href="#noteorig10"><sup>10</sup></a> Association lists should not to be confused with associative arrays, such as those in PHP. A PHP associative array is like a hash in Perl or Ruby, a Python dictionary or a JavaScript or JSON object. In contrast, an association list, or <em>alist</em>, is not a primitive type in those languages, but can be built as an array or list of key-value pairs, where each pair is also an array or list. In Python, a more readable and convenient representation would be a list of tuples: <code>[(key1, value1), (key2, value2), ...]</code>. Association lists originated in Lisp and are common in Scheme (<a href="#mit2010">MIT, 2010</a>).

<a name="note11" href="#noteorig11"><sup>11</sup></a> As far as we know, apart from HTTP, the only other way to interact with CouchDB is via Hovercraft, an Erlang library which provides direct access to the database.

<a name="note12" href="#noteorig12"><sup>12</sup></a> CouchDB indexes are generated by “view servers” which can be written in any language. An Erlang view server is bundled with recent versions of CouchDB, but is not enabled by default. Python can be easily configured for that purpose, and Java is also known to be used. A list of view server implementations can be found at <a href="http://wiki.apache.org/couchdb/View_server" >http://wiki.apache.org/couchdb/View_server</a>.

<a name="note13" href="#noteorig13"><sup>13</sup></a> The numbers in the first column (98, 10, 38 in the example) identify a specific index, usually coinciding with the indexed tag number, but not necessarily. Some versions of ISIS support advanced querying using those numbers, others do not.

<h2>References</h2>

<a name="abiteboul1999"></a>ABITEBOUL, Serge; BUNEMAN, Peter; SUCIU, Dan. Data on the Web: From Relations to Semistructured Data and XML. San Francisco: Morgan Kaufmann, 1999.

<a name="anderson2009"></a>ANDERSON, C.; CHESNEAU, B. CouchApp: Standalone CouchDB Application Development Made Simple, 2009 [cited 2011 Mar. 26]. Available from: <a href="https://github.com/couchapp/couchapp/" >https://github.com/couchapp/couchapp/</a>.

<a name="apache2011a"></a>APACHE FOUNDATION. The Apache Cassandra Project [cited 2011 Mar. 26]. Available from: <a href="http://cassandra.apache.org/">http://cassandra.apache.org/</a>.

<a name="apache2011b"></a>APACHE FOUNDATION. The Apache CouchDB Project [cited 2011 Mar. 26]. Available from: <a href="http://couchdb.apache.org/">http://couchdb.apache.org/</a>.

<a name="avram1975"></a>AVRAM, Henriette D. MARC: Its History and Implications. Washington, DC: U.S. Government Printing Office, 1975.

<a name="barbieri2011"></a>BARBIERI, Heitor. Bruma (Java library source code). São Paulo: BIREME/OPAS/OMS, 2010 [cited 2011 Mar. 26]. Available from: <a href="http://reddes.bvsalud.org/projects/isisnbp/browser/Bruma" >http://reddes.bvsalud.org/projects/isisnbp/browser/Bruma</a>.

<a name="bireme2008"></a>BIREME/OPAS/OMS. Diccionario de Datos del Modelo LILACS Versión 1.6a. São Paulo: BIREME/OPAS/OMS, 2008 [cited 2011 Mar. 26]. Available from: <a href="http://bvsmodelo.bvs.br/download/lilacs/LILACS-5-DicionarioDados-es.pdf" >http://bvsmodelo.bvs.br/download/lilacs/LILACS-5-DicionarioDados-es.pdf</a>

<a name="bireme2010"></a>BIREME/OPAS/OMS. ISIS-JSON types. São Paulo: BIREME/OPAS/OMS, 2010 [cited 2011 Mar. 21]. Available from: <a href="http://reddes.bvsalud.org/projects/isisnbp/wiki/ISIS-JSON_types" >http://reddes.bvsalud.org/projects/isisnbp/wiki/ISIS-JSON_types</a>.

<a name="bireme2010b"></a>BIREME/OPAS/OMS. ISIS-DM: The ISIS Data Model API. São Paulo: BIREME/OPAS/OMS, 2010 [cited 2011 Mar. 29]. Available from: <a href="http://github.com/bireme/isisdm">http://github.com/bireme/isisdm</a>.

<a name="chang2006"></a>CHANG, Fay; DEAN, Jeffrey; GHEMEWAT, Sanjay; et al. Bigtable: A Distributed Storage System for Structured Data. In: OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, 2006 [cited 2011 Mar. 21]. Available from: <a href="http://labs.google.com/papers/bigtable.html" >http://labs.google.com/papers/bigtable.html</a>.

<a name="codd1970"></a>CODD, E. F. A Relational Model of Data for Large Shared Data Banks. Communications of the ACM, v. 13, n. 6, p. 377-387, jun. 1970 [cited 2011 Mar. 26]. Available from: <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.5286&rep=rep1&type=pdf" >http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.5286&rep=rep1&type=pdf</a>.

<a name="crockford2006a"></a>CROCKFORD, D. RFC4627: The application/json Media Type for JavaScript Object Notation (JSON). The Internet Society, 2006 [cited 2011 Mar. 21]. Available from: <a href="http://tools.ietf.org/html/rfc4627" >http://tools.ietf.org/html/rfc4627</a>.

<a name="crockford2006b"></a>CROCKFORD, D. JSON: The Fat-Free Alternative to XML. JSON.ORG, 2006 [cited 2011 Mar. 21]. Available from: <a src="http://www.json.org/fatfree.html"> http://www.json.org/fatfree.html</a>.

<a name="date2005"></a>DATE, C. J. Database in Depth: Relational Theory for Practitioners. Sebastopol: O'Reilly Media, 2005.

<a name="decandia2007"></a>DECANDIA, Giuseppe; HASTORUN, Deniz; JAMPANI, Madan; et al. Dynamo: Amazon's Highly Available Key-Value Store. In: Proceedings of the 21st ACM Symposium on Operating Systems Principles, Stevenson, WA, 2007.

<a name="dirolf2010"></a>DIROLF, M. Bin­ary JSON. BSONSPEC.ORG, 2010 [cited 2011 Mar. 21]. Available from: <a src="http://bsonspec.org/#/specification" >http://bsonspec.org/#/specification</a>.

<a name="elmasri2006"></a>ELMASRI, R; NAVATHE, S. B. Fundamentals of Database Systems. 5<sup>th</sup> ed. Reading, MA: Addison-Wesley, 2006.

<a name="hopkinson2005"></a>HOPKINSON, Alan. CDS/ISIS: The second decade. Information Development. February 2005 vol. 21 no. 1 p.31-37 [cited 2011 Mar. 26]. Available from: <a href="http://eprints.mdx.ac.uk/2700/3/isisseconddecade.pdf" >http://eprints.mdx.ac.uk/2700/3/isisseconddecade.pdf</a>.

<a name="jquery2011"></a>JQUERY PROJECT. QUnit, 2011. [cited 2011 Mar. 21]. Available from: <a src="http://docs.jquery.com/Qunit" >http://docs.jquery.com/Qunit</a>

<a name="lopes2010"></a>LOPES, Francisco. Historia_ISIS.doc (internal memo). São Paulo: BIREME/OPAS/OMS, 2010.

<a name="merriman2010"></a>MERRIMAN, Dwight. Comparing Mongo DB and Couch DB. MongoDB.org. [cited 2011 Mar. 21]. Available from: <a src="http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB" >http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB">.

<a name="mit2010"></a>MIT. MIT/GNU Scheme - Association Lists. Cambridge, MA: Massachusetts Institute of Technology, 2010 [cited 2011 Mar. 22]. Available from: <a href="http://web.mit.edu/scheme_v9.0.1/doc/mit-scheme-ref/Association-Lists.html" >http://web.mit.edu/scheme_v9.0.1/doc/mit-scheme-ref/Association-Lists.html</a>.

<a name="mongodb2010"></a>MONGODB.ORG. What about Durability? MongoDB Blog, 2010 [cited 2011 Mar. 22]. Available from: <a href="http://blog.mongodb.org/post/381927266/what-about-durability" >http://blog.mongodb.org/post/381927266/what-about-durability</a>.

<a name="mongodb2011"></a>MONGODB.ORG. MongoDB 1.8 Released. MongoDB Blog, 2011 [cited 2011 Mar. 22]. Available from: <a href="http://blog.mongodb.org/post/3903149313/mongodb-1-8-released" >http://blog.mongodb.org/post/3903149313/mongodb-1-8-released</a>.

<a name="pylons2011"></a>PYLONS PROJECT. Deform. Agendaless Consulting, 2011 [cited 2011 Mar. 22]. Available from: <a "http://docs.pylonsproject.org/projects/deform/dev/#" >http://docs.pylonsproject.org/projects/deform/dev/</a>.

<a name="ramalho2011"></a>RAMALHO, L. LILACS on CouchDB, 2011 [cited 2011 Mar. 26]. Available from: <a href="https://bitbucket.org/ramalho/lilcouch" >https://bitbucket.org/ramalho/lilcouch</a>

<a name="silberschatz2006"></a>SILBERSCHATZ, A.; KORTH, H.; Sudarshan, S. Database System Concepts, 5<sup>th</sup> Ed. New York: McGraw-Hill, 2006.

<a name="suciu2001"></a>SUCIU, Dan. Managing XML and Semistructured Data - Lecture Series (digital slides). Seattle: Department of Computer Science & Engineering, University of Washington, 2001 [cited 2011 Mar. 21]. Available from: <a href="http://www.cs.washington.edu/homes/suciu/COURSES/590DS/" >http://www.cs.washington.edu/homes/suciu/COURSES/590DS/</a>

<a name="suciu2009"></a>SUCIU, Dan. Semi-Structured Data Model. In: LIU, L.; ÖZSU, M. T. Encyclopedia of Database Systems: Springer, 2009. p. 2601-2605.

<a name="thomale2010"></a>THOMALE, Jason. Interpreting MARC: Where's the Bibliographic Data? Code4Lib Journal, Issue 11, 2010-09-21 [cited 2011 Mar. 26]. Available from: <a href="http://journal.code4lib.org/articles/3832" >http://journal.code4lib.org/articles/3832</a>.

<a name="tok2005"></a>TOK, Wang Ling; LEE, Mong Li; DOBBIE, Gillian. Semistructured Database Design. Boston: Springer Science, 2005.

<a name="w3c1998"></a>W3C. Extensible Markup Language (XML) 1.0 (Fifth Edition), 2008. [cited 2011 Mar. 26]. Available from: <a href="http://www.w3.org/TR/REC-xml/#sec-mixed-content" >http://www.w3.org/TR/REC-xml/#sec-mixed-content</a>

<h2 class="abouttheauthor">About the Author</h2>

<a href="mailto:luciano.ramalho@bireme.org">Luciano Ramalho</a> was designing large-scale Web publishing systems before the Netscape IPO and the first release of Internet Explorer. He has a B.S. in Library Sciences from the University of São Paulo, and is a software development supervisor at BIREME/PAHO/WHO, a digital library that is part of the Knowledge Management and Communication area of the Pan American Health Organization. His English-language blog is at <a href="http://standupprogrammer.blogspot.com" >http://standupprogrammer.blogspot.com</a>.