Source

papers / code4lib13 / isis2couchdb.txt

Full commit
   1
   2
   3
   4
   5
   6
   7
   8
   9
  10
  11
  12
  13
  14
  15
  16
  17
  18
  19
  20
  21
  22
  23
  24
  25
  26
  27
  28
  29
  30
  31
  32
  33
  34
  35
  36
  37
  38
  39
  40
  41
  42
  43
  44
  45
  46
  47
  48
  49
  50
  51
  52
  53
  54
  55
  56
  57
  58
  59
  60
  61
  62
  63
  64
  65
  66
  67
  68
  69
  70
  71
  72
  73
  74
  75
  76
  77
  78
  79
  80
  81
  82
  83
  84
  85
  86
  87
  88
  89
  90
  91
  92
  93
  94
  95
  96
  97
  98
  99
 100
 101
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
by Luciano G. Ramalho

<h2><a name="section1"></a>1. Introduction</h2>

The relational database model is well grounded in theory and well supported by
the software industry. Relational database management systems (RDBMS) replaced
legacy hierarchical and network database products in nearly every enterprise and
became the mainstream. Object-oriented DBMS have not taken the leading role
from relational systems, even after object-oriented programming became the norm,
and in spite of the so-called “impedance mismatch” perceived when one needs
to store nested objects into a collection of flat tables.

After resisting the object-oriented wave, relational databases became so
dominant that they seemed like the only game in town, or at least the only game
in most shops. But in the last five years, Google published a paper on BigTable
(<a href="#chang2006">Chang, 2006</a>), the proprietary, distributed
non-relational database behind many of its online properties. Then Amazon.com
revealed Dynamo, the non-relational database that powers some of their AWS
cloud-based services (<a href="#decandia2007">Decandia, 2007</a>). Facebook
open-sourced Cassandra, the distributed non-relational database created to
enable its crucial in-box search feature. Cassandra is now also used by Twitter,
Digg, Rackspace and Cisco, among many others, and has such a vibrant community
that it became a top-level project of the Apache Foundation in 2010, less than
three years after its initial public release (<a href="#apache2011a">Apache,
2011a</a>). These developments are part of a trend that became known, in 2009,
as the “NoSQL" movement. NoSQL stands for “No SQL” or “Not Only SQL”, depending
on who you ask; but everyone agrees it's not really about the SQL language, but
about seeking alternative answers to problems that are not amenable to
relational solutions.

Meanwhile, thousands of small to medium libraries in developing countries have
been oblivious to all this and continue using their ISIS databases for daily
operations. ISIS is a family of non-relational database systems with a history
that goes back to the 1970's. It was developed by UNESCO – the United Nations
Educational, Scientific and Cultural Organization – specifically for
bibliographic data, and it is still officially distributed and widely used (<a
href="#hopkinson2005">Hopkinson, 2005</a>).

ISIS had a very positive impact, allowing libraries with limited resources to
computerize their catalogs on simple, stand-alone PCs. Some ISIS products became
Web-enabled, but after that, progress has been very slow in the last 10 years.
Meanwhile, a couple of new Open Source NoSQL databases present a possible
migration path for ISIS users, a path that will not require wholesale
normalization of decades of bibliographic data. This paper describes how
BIREME/PAHO/WHO, a digital library that is part of the Pan American Health
Organization, and a large-scale user of ISIS databases, has studied the
viability of migrating one of its main databases to Apache CouchDB, a modern
document database.

The rest of the paper is structured as follows. <a href="#section2">Section
2</a> discusses data models for bibliographic records, explains why it is
tempting for librarians to look beyond the relational model and compares the
structure of MARC records to the semistructured data model formalized in the
1990's. <a href="#section3">Section 3</a> briefly presents ISIS, then compares
two recent products that support semistructured records, CouchDB and MongoDB,
and shows our criteria for choosing among them. <a href="#section4">Section
4</a> discusses different formats for representing ISIS records in CouchDB, then
describes the tool we developed for converting records and the process used to
load them into CouchDB. <a href="#section5">Section 5</a> talks about how
indexing works in ISIS and CouchDB, and exemplifies queries on the latter.
Finally, <a href="#section6">Section 6</a> summarizes what we have done, our
conclusions and ongoing work.

<h2><a name="section2"></a>2. Data models for bibliographic records</h2>

One of the defining characteristics of the relational data model, the First
Normal Form (1NF), dictates that attributes must be atomic. In other words, in
the industry-standard flat relational model<sup><a name="noteorig1"
href="#note1">1</a></sup>, fields cannot be structured into subfields nor
contain multiple values.

On the other hand, multivalued fields are useful and common for representing
bibliographic data. Books often have more than one author and cover multiple
subjects. A publisher, while logically a single attribute of a book record, has
attributes of its own, like a name and a place. It is therefore useful to split
a publisher field into parts. That is why MARC supports subfields and
multivalued fields through repeating tags (similar to XML). The MARC structure,
or “empty container”<sup><a name="noteorig2" href="#note2">2</a></sup> carried
over to the ISO-2709 standard, which describes a generic data interchange format
allowing applications to attach any meaning to the tags and subfield markers.
Here we will refer to the “ISO-2709 data model” as a generalization of the data
model implied by the MARC record structure.

Of course, library data is often stored in relational databases, but
normalization means that the description of a single item is spread over
multiple records in several tables. That may work well within the context of one
library, but when exchanging data, it is more practical to have just one record
per item. That is one reason why MARC is still with us, and why XML became so
important in a world dominated by relational databases.

If the ISO-2709 data model does not adhere to the flat relational model, then
what is the theory behind it? At first there was none, but in the mid-1990s,
while XML was still under development, researchers from UPenn, Stanford, AT&T
Labs and the INRIA French R&D agency conceived the semistructured<sup><a
name="noteorig3" href="#note3">3</a></sup> data model (<a
href="#abiteboul1999">Abiteboul, 1999</a>). Their work supports reasoning about
MARC and ISO-2709 records but also about the richer data models of XML and
JSON<sup><a name="noteorig4" href="#note4">4</a></sup> documents.

A useful definition by Dan Suciu, a pioneer of semistructured data research,
appears in the “Encyclopedia of Database Systems” (<a href="#suciu2009">Suciu,
2009</a>):

<blockquote>
The semi-structured data model is designed as an evolution of the relational
data model that allows the representation of data with a flexible structure.
Some items may have missing attributes, others may have extra attributes,
some items may have two or more occurrences of the same attribute. The type
of an attribute is also flexible: it may be an atomic value or it may be
another record or collection. Moreover, collections may be heterogeneous,
i.e., they may contain items with different structures. The semi-structured
data model is a self-describing data model, in which the data values and the
schema components co-exist.
</blockquote>

Thanks to the semistructured data model, we don't need to feel ashamed of our
denormalized bibliographic records any more. Even better, we have a framework to
evaluate databases of the current NoSQL crop with regard to our needs as
caretakers of denormalized datasets.

The book “Data on the Web: From Relations to Semistructured Data and XML”
(<a href="#abiteboul1999">Abiteboul, 1999</a>), introduces a notation for
representing semistructured data. It is called “ssd-expression” and it looks
like this:

<pre>
  {name: {first: "Lewis", last: "Carroll"},
   tel: 5553457,
   email: "cld@ox.ac.uk",
   email: "lcarroll@pobox.com"
  }
</pre>

Except for the repetition of the <code>email</code> key, it's very similar to
JSON (<a href="#crockford2006">Crockford 2006</a>). The presence of keys
describing fields like <code>name</code>, <code>first</code> etc., and the
storage of those keys alongside the data, is what the encyclopedia entry means
by coexisting data values and schema components (<a href="#suciu2009">Suciu,
2009</a>). It is a characteristic shared by ISO-2709, XML, and JSON (with the
limitation that ISO-2709 fields are identified by tags composed of three
alphanumeric ASCII characters, though most applications only use numeric tags).
The repeated <code>email</code> tag could be represented similarly in XML and in
ISO-2709, but in JSON the keys must be unique within an object structure, so
multivalued fields are often represented as lists of values:

<pre>[sourcecode language='javascript']
  {"name": {"first": "Lewis", "last": "Carroll"},
   "tel": 5553457,
   "email": ["cld@ox.ac.uk", "lcarroll@pobox.com"]
  }
[/sourcecode]</pre>

The ISO-2709 record format influenced the record structure of the ISIS family of
databases, introduced by UNESCO in the 1970s. ISIS records follow the structure
of the ISO-2709 standard, and most ISIS systems import and export to that
format. Here is the same record of the examples above, as seen in a common ISIS
display format:

<pre>
  10 «^fLewis ^lCarroll»
  20 «5553457»
  30 «cld@ox.ac.uk»
  30 «lcarroll@pobox.com»
</pre>

In an ISIS record the field names are replaced by numeric tags. The «»
delimiters are not part of the content, but a feature of the display format.
Here there are two occurrences of tag #30, the e-mail field. Field #10 is split
into two subfields ("f" and "l") by special markers. Instead of the subfield
delimiter control character used in MARC systems, ISIS uses the circumflex
accent: ^ (ASCII 94). A subfield identifier is case-insensitive and consists of
just one ASCII letter or digit. The ISIS data model is simpler than that of
ISO-2709 because it does not have field indicators<sup><a
name="noteorig5" href="#note5">5</a></sup> and subfields are not
repeatable<sup><a name="noteorig6" href="#note6">6</a></sup>.

Clearly the ISIS data model is not as flexible as the general semistructured
data model. Given the syntax of subfield markers, ISIS fields are limited to one
level of nesting only. In other words, borrowing from the XML jargon: subfields
cannot have child elements, just character data. Also, an ISIS field may have
mixed content<sup><a name="noteorig7" href="#note7">7</a></sup>: some unmarked
text may appear before the first subfield marker. For example, in this field:

<pre>
  18 «Lords of Finance^bThe Bankers Who Broke the World»
</pre>

The main title, "Lords of Finance" is not preceded by a subfield marker. The
semistructured data model, as described by Abiteboul et. al., has no concept of
mixed content like XML has. So we will call that part “the main subfield” and
pretend it is preceded by an implied "^_" (underscore) subfield marker whenever
we need to refer to it in code. With this simple arrangement, any ISIS record
can be represented as an ssd-expression or a JSON record.

<h2><a name="section3"></a>3. Semistructured database systems</h2>

<h3>3.1. The ISIS family</h3>

In 1985, Giampaolo Del Bigio ported UNESCO's CDS/ISIS mainframe database system
to PC/DOS (<a href="#lopes2010">Lopes, 2010</a>). The result, called MicroISIS,
was distributed free of charge and became the <i>de facto</i> standard for
computerized library catalogs in developing countries. Its Windows version,
WinISIS is still distributed by UNESCO and remains widely used in small to
medium libraries (<a href="#hopkinson2005">Hopkinson, 2005</a>). Beyond local
OPACs, ISIS is also used by two large regional bibliographic databases in Latin
America and the Caribbean: SciELO (Scientific Electronic Library Online) and
LILACS (Latin American and Caribbean Health Sciences index), both built by
BIREME/PAHO/WHO, which is a specialized center of the Pan American Health
Organization located in São Paulo, Brazil. Together, SciELO and LILACS routinely
handle millions of bibliographic records using CISIS, a custom version of ISIS
developed by BIREME/PAHO/WHO. MicroISIS, WinISIS, CISIS and the more recent
J-ISIS from UNESCO are all interoperable and form the ISIS family of database
systems.

The legacy ISIS codebases are showing their age, and after years of
closed-source development, none of the derivations have managed to become
successful Open Source projects. Meanwhile, the NoSQL movement has highlighted
some Open Source non-relational databases implementing variations of the
semistructured data model. This has motivated BIREME/PAHO/WHO to look for
alternatives.

<h3>3.2 CouchDB and MongoDB</h3>

Among the recent crop of NoSQL systems, two products stand out for implementing
the semistructured data model in a way that matches the operational needs of
BIREME/PAHO/WHO: Apache CouchDB and MongoDB (<a href="#apache2011b">Apache,
2011b</a>, <a href="#mongodb2011">MongoDB.org, 2011</a>). Their data model is
not as flat as the key-value stores (optimized for retrieving blobs given a
primary key), nor as deep as graph databases (designed to allow general queries
over paths of nested objects). What they offer is somewhere in between:
JSON-like records allowing nested structures, and expressive query languages to
index and retrieve those records. Those rich records are called “documents”.
CouchDB and MongoDB call themselves “document databases”.

In the case of CouchDB, the document format is JSON. MongoDB uses BSON, a binary
format inspired by JSON but offering more data types, such as int32, datetime,
and even a type called “JavaScript code w/ scope” (<a href="#dirolf2010">Dirolf,
2010</a>). JSON can be trivially converted to BSON, but the reverse may not be
so easy, depending on the data. While the difference between those formats can
be considered an implementation detail, they reflect different priorities.

MongoDB is optimized for fast updating, by aggressively caching writes and by
overwriting updated records in place whenever possible: the BSON structure is
designed to allow updating specific fields of an existing document. The
flip-side of this optimization is that a software crash can cause data loss, and
a minimum deployment of two servers in master-slave configuration is recommended
if durability is required<sup><a name="noteorig8" href="#note8">8</a></sup>. In
contrast, CouchDB is fault-tolerant: it only appends to the database file on
disk, and that is an atomic operation in modern operating systems. As a result,
the database file is always consistent in the event of a software crash, and
backups can be made while the system is running. The drawbacks of the
append-only design are the need for periodic database compaction – a
time-consuming batch operation – and slower updates.

CouchDB was influenced by Lotus Notes, a networked, collaborative application
suite designed to support users who are often off-line. So, CouchDB allows
master-master replication, that is, synchronization between peer nodes which
have received inserts and updates independently. MongoDB supports only
master-slave replication: only one node receives updates and inserts, and
replicates to the slaves.

While CouchDB can be used in large clusters, it is also well-suited as a
database for small desktop apps. For that reason it comes pre-installed on
Ubuntu Linux since 2009. It even runs on Android mobile devices (a free,
one-click installer can be found in the Android Marketplace). Because CouchDB
supports HTTP and JSON natively, and can run JavaScript procedures on the
server, applications can be developed without any middleware: browsers
communicate directly with the database, and all the logic is written in
JavaScript.

The informative post “Comparing Mongo DB and Couch DB” was written by Dwight
Merriman, CEO of 10gen, the company behind MongoDB (<a
href="#merriman2010">Merriman, 2010</a>). Given the characteristics of these
systems, we decided to use CouchDB for our LILACS experiments for these reasons:

<ul>
<li>Easier deployment in a durable configuration.</li>
<li>Support for master-master replication, useful for distributed cooperative
  cataloging applications.</li>
<li>Direct support for JSON over HTTP, enabling Web Services and AJAX
  applications without middleware.</li>
</ul>

However, we do envision scenarios in which we would like to use both MongoDB and
CouchDB. For example, the canonical store of bibliographic data could be
CouchDB, while a cluster of MongoDB instances could be used for user-provided
content, tracking, recommendations and other features demanding faster database
writes.

<h2><a name="section4"></a>4. Loading ISIS records into CouchDB</h2>

<h3>4.1. Alternative representations of ISIS records as JSON documents</h3>

The ISIS data model, a subset of the ISO-2709 model, is not as expressive as the
JSON data model. There are several ways to represent the same ISIS record in
JSON. Consider this abridged example of a LILACS record:

<pre>
   1 «BR1.1»
   2 «538886»
   4 «LILACS»
   4 «LLXPEDT»
   5 «S»
   6 «as»
   8 «Internet^ihttp://…/imageBank/PDF/v3n3a04.pdf?aid2=168&amp;…»
  10 «Kanda, Paulo Afonso de Medeiros^1University of São Paulo
      ^2School of Medicine^3Cognitive Disorders of Clinicas
      Hospital Reference Center^pBrasil ^cSão Paulo^rorg»
  10 «Anghinah, Renato^1University of São Paulo^2School of
      Medicine ^3Cognitive Disorders of Clinicas Hospital
      Reference Center ^pBrasil^cSão Paulo^rorg»
  12 «The Clinical use of quantitative EEG in cognitive disorders
      ^ien»
  12 «A utilização clínica do EEG quantitativo nos transtornos
      cognitivos^ipt»
  30 «Dement. neuropsychol»
  31 «3»
  32 «3»
  35 «1980-5764»
</pre>

Tag #2 is the LILACS identifier, a non-repeating, numeric field, and tag #10 is
an author field in an analytic (article-level), record. The LILACS data
dictionary (<a href="#bireme2008">BIREME, 2008</a>) defines #10 as a repeating
field. In addition, it is composed of the following subfields:

<table>
<tr>
	<th>subfield</th>
	<th>description</th>
	<th>sample content</th>
</tr>
<tr>
	<th>main</th>
	<td>author name</td>
	<td>Kanda, Paulo Afonso de Medeiros</td>
</tr>
<tr>
	<th>^1</th>
	<td>affiliation level 1</td>
	<td>University of São Paulo</td>
</tr>
<tr>
	<th>^2</th>
	<td>affiliation level 2</td>
	<td>School of Medicine</td>
</tr>
<tr>
	<th>^3</th>
	<td>affiliation level 3</td>
	<td>Cognitive Disorders of Clinicas Hospital Reference Center</td>
</tr>
<tr>
	<th>^p</th>
	<td>country</td>
	<td>Brasil</td>
</tr>
<tr>
	<th>^c</th>
	<td>city</td>
	<td>São Paulo</td>
</tr>
<tr>
	<th>^r</th>
	<td>degree of responsibility</td>
	<td>org (organizer)</td>
</tr>
</table>

One way to represent such a record in JSON is shown below (only fields #10 and
#2 are shown; some contents shortened for clarity):

<pre>[sourcecode language='javascript']
{
    "10": [
        {
            "_": "Kanda, Paulo Afonso",
            "1": "University of São Paulo",
            "2": "School of Medicine",
            "3": "Cognitive Disorders of … Reference Center",
            "c": "São Paulo",
            "p": "Brasil",
            "r": "org"
        },
        {
            "_": "Smidth, Magali Taino",
            "1": "University of São Paulo",
            "2": "School of Medicine",
            "3": "Cognitive Disorders of … Reference Center",
            "c": "São Paulo",
            "p": "Brasil",
            "r": "org"
        }
    ],
    "2": [
        {
            "_": "538886"
        }
    ]
}
[/sourcecode]</pre>

In the representation above, the record is a JSON mapping between tags (eg.
“10”) and lists of values (or occurrences in ISIS jargon). The use of lists
allows any field to be repeatable, therefore any possible ISIS record fits this
scheme. Also, within the list of occurrences, each one is represented by a
mapping of subfield keys and values. The special key “_” (underscore) is used to
denote the main subfield, and when there are no subfields that is the only key.
This creates unnecessary nesting, as demonstrated by field #2 which is the
LILACS identifier field, non-repeating and devoid of subfields as defined by the
LILACS data dictionary. But it does make the structure homogeneous, an important
feature when dealing with legacy semistructured datasets where not all records
adhere to the current schema. During our research we called the representation
above ISIS-JSON type 3 (<a href="#bireme2010">BIREME, 2010</a>).

The main drawback of ISIS-JSON type 3 is that, by definition, JSON mappings are
unordered<sup><a name="noteorig9" href="#note9">9</a></sup>. Therefore, neither
the tag order within the record nor the subfield order within a field occurrence
are preserved. In the LILACS database, tag order within a record is only
relevant when there are repeating tags, such as #10 (author, analytical level).
Fortunately, keeping repeated occurrences in a list does preserve their relative
ordering, even if the overall field order is not kept. For example, in the
original ISIS record under analysis field #2 appeared before all occurrences of
field #10, while in the JSON representation the key “10” precedes key “2”. This
is irrelevant in practice. But, crucially, the order of the authors is the same
in both formats: first “Kanda, Paulo Afonso”, then “Smidth, Magali Taino”.

Regarding subfield ordering within a field, the LILACS data dictionary does
establish a canonical ordering. For example, in the case of field #10 the order
is _, 1, 2, 3, p, c, r (the main subfield is always first, by definition).
However, as Jason Thomale points out in “Interpreting MARC: Where’s the
Bibliographic Data?” (<a href="#thomale2010">Thomale, 2010</a>), subfield
markers should be interpreted as textual markup, and their ordering is
significant. Particularly when cleaning up or converting legacy data,
maintaining the order of subfields from the original record may be important to
ascertain the original cataloger's intention, even if LILACS subfields do have a
prescribed ordering.

To preserve subfield ordering, two alternative JSON representations have been
tried at BIREME/PAHO/WHO. The first is ISIS-JSON type 1 (line breaks added
within long strings for clarity):

<pre>[sourcecode language='javascript']
{
    "10": [
        "Kanda, Paulo Afonso de Medeiros^1University of São Paulo
          ^2School of Medicine^3Cognitive Disorders of Clinicas
          Hospital Reference Center^pBrasil ^cSão Paulo^rorg",
        "Smidth, Magali Taino^1University of São Paulo ^2School
          of Medicine^3Cognitive Disorders of Clinicas Hospital
          Reference Center^pBrasil ^cSão Paulo^rorg"
    ],
    "2": ["538886"]
}
[/sourcecode]</pre>

In this format, each field occurrence is a single string, with the subfield
delimiters embedded. This essentially “punts” on the issue of how to represent
subfields, leaving the parsing to be done by the database. Because CouchDB uses
JavaScript as its default language for creating indexes, it is a viable
approach. Another possibility is ISIS-JSON type 2:

<pre>[sourcecode language='javascript']
{
    "10": [
        [
            ["_", "Kanda, Paulo Afonso"],
            ["1", "University of São Paulo"],
            ["2", "School of Medicine"],
            ["3", "Cognitive Disorders of … Reference Center"],
            ["p", "Brasil"],
            ["c", "São Paulo"],
            ["r", "org"]
        ],
        [
            ["_", "Smidth, Magali Taino"],
            ["1", "University of São Paulo"],
            ["2", "School of Medicine"],
            ["3", "Cognitive Disorders of … Reference Center"],
            ["p", "Brasil"],
            ["c", "São Paulo"],
            ["r", "org"]
        ]
    ],
    "2": [
        [
            ["_", "538886"]
        ]
    ]
}
[/sourcecode]</pre>

Here each field occurrence is an association list<sup><a
name="noteorig10" href="#note10">10</a></sup>, that is, a mapping represented as
a list of lists, where each inner list contains a key-value pair. The advantage
of an associative list over a JSON object mapping is that the order of items is
preserved, but at the cost of a linear search to locate a key within a field
occurrence. Because of how indexing works in CouchDB, this is not as costly as
it may seem, as will be seen in the next section. For MARC-style records, the
association list has the added advantage of allowing repeated mappings to
represent repeating subfields.

<h3>4.2. The isis2json conversion tool</h3>

To convert ISIS records into JSON structures, we developed a Python script
called <code>isis2json.py</code>. It can be executed with both the Python and
Jython interpreters, versions 2.5 through 2.7. When running under Python it can
read only ISO-2709 files, but as a Jython script it leverages the Bruma Java
library (<a href="#barbieri2011">Barbieri, 2011</a>) and can also read binary
ISIS files in .MST format directly. Several options control the structure of the
JSON output. For example, the command line below generates output suitable for
batch importing to CouchDB:

<pre>
$ ./isis2json.py cds.iso -c -t 3 -q 100 > cds1.json
</pre>

The arguments used in the above example are:

<code>-c</code> to generate records within a “docs” list, as required by the
CouchDB <code>_bulk_docs</code> API;

<code>-t 3</code> for ISIS-JSON type 3 export (fields as dictionaries of
subfields);

<code>-q 100</code> to output only 100 records;

Here is the help screen of isis2json.py:

<pre>
$ ./isis2json.py -h
usage: isis2json.py [-h] [-o OUTPUT.json] [-c] [-m] [-t ISIS_JSON_TYPE]
                    [-q QTY] [-s SKIP] [-i TAG_NUMBER] [-u] [-p PREFIX] [-n]
                    [-k TAG:VALUE]
                    INPUT.(mst|iso)

Convert an ISIS .mst or .iso file to a JSON array

positional arguments:
  INPUT.(mst|iso)       .mst or .iso file to read

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT.json, --out OUTPUT.json
                        the file where the JSON output should be written
                        (default: write to stdout)
  -c, --couch           output array within a "docs" item in a JSON document
                        for bulk insert to CouchDB via POST to db/_bulk_docs
  -m, --mongo           output individual records as separate JSON
                        dictionaries, one per line for bulk insert to MongoDB
                        via mongoimport utility
  -t ISIS_JSON_TYPE, --type ISIS_JSON_TYPE
                        ISIS-JSON type, sets field structure: 1=string,
                        2=alist, 3=dict
  -q QTY, --qty QTY     maximum quantity of records to read (default=ALL)
  -s SKIP, --skip SKIP  records to skip from start of .mst (default=0)
  -i TAG_NUMBER, --id TAG_NUMBER
                        generate an "_id" from the given unique TAG field
                        number for each record
  -u, --uuid            generate an "_id" with a random UUID for each record
  -p PREFIX, --prefix PREFIX
                        concatenate prefix to every numeric field tag (ex. 99
                        becomes "v99")
  -n, --mfn             generate an "_id" from the MFN of each record
                        (available only for .mst input)
  -k TAG:VALUE, --constant TAG:VALUE
                        Include a constant tag:value in every record (ex. -k
                        type:AS)
</pre>


<h3>4.3. Interacting with CouchDB</h3>

HTTP is the main protocol for interacting with CouchDB<sup><a
name="noteorig11" href="#note11">11</a></sup>. Most database administration is
done via Futon, its built-in web interface.

<div style="width: 500px; margin: auto;"><a
href="/media/issue13/ramalho/fig1-futon.png"><img class="caption"
src="/media/issue13/ramalho/fig1-futon-500x422.png"
alt="Futon screen shot: viewing a record" /></a></div> <p
class="caption"><strong>Figure 1.</strong> CouchDB Futon web interface showing a
LILACS record in ISIS-JSON Type-2 format.</p>

Nearly every task which can be done via Futon can also be automated using the
RESTful API of CouchDB via any HTTP client. In fact, interacting with CouchDB is
a great way to learn, in practice, what REST and RESTful interfaces are all
about. Here we use cURL, a command-line HTTP client available for many systems.
For example, a database is created with a simple PUT request:

<pre>
$ curl -X PUT http://admin_user:password@127.0.0.1:5984/lilacs
</pre>

One operation that cannot be done via Futon is uploading records in bulk.
Continuing the previous example, we could convert an ISO-2709 file to JSON, then
upload it to CouchDB using cURL in two steps:

<pre>
$ ./isis2json.py cds.iso -c > cds1.json
$ curl -d cds1.json -H"Content-Type: application/json" \
       -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
</pre>

When used together, the <code>-q</code> and the <code>-s</code> (skip) options
allow splitting the output into several batches, which results in faster loading
for large datasets on multi-core servers. For example, loading 20,000 full
LILACS records, totaling 64MB, took 147 seconds using this shell command:

<pre>
$ isis2json.py -c -p v -i 2 -q 20000 lilacs100k.iso | \
  curl -d @- -H"Content-Type: application/json" \
       -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
</pre>

Loading the same data in two 10,000-record batches took 100 seconds when taking
advantage of two cores with a simple shell script like this one:

<pre>
#!/bin/bash
isis2json.py -c -p v -i 2 -q 10000 lilacs100k.iso | \
curl -d @- -H"Content-Type: application/json" \
     -X POST http://127.0.0.1:5984/lilacs/_bulk_docs &
isis2json.py -c -p v -i 2 -q 10000 -s 10000 lilacs100k.iso | \
curl -d @- -H"Content-Type: application/json" \
     -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
</pre>

If an <code>_id</code> attribute is not present in an inserted document, CouchDB
provides one with a UUID (Universally Unique Identifier) like
<code>"ead3af23a4459b2d7a1aef05cb0012a9"</code>. It is highly recommended that
an <code>_id</code> is given when adding documents to prevent inadvertent
duplication of records if a bulk loading process is interrupted and restarted.
Therefore <code>isis2json.py</code> provides the <code>-i</code> option, used in
the examples above, to fetch the value of one field in the ISIS input and use it
as the <code>_id</code> attribute.

So, for example, using the <code>-i 2</code> option this (partial) ISIS
structure:

<pre>
   1 «BR1.1»
   2 «538886»
   4 «LILACS»
   4 «LLXPEDT»
</pre>

Can be converted into this JSON document record, ready to import into CouchDB:

<pre>[sourcecode language='javascript']
{
    "_id": "538886",
    "1": [
        [
            ["_", "BR1.1"]
        ]
    ],
    "2": [
        [
            ["_", "538886"]
        ]
    ],
    "4": [
        [
            ["_", "LILACS"]
        ],
        [
            ["_", "LLXPEDT"]
        ]
    ]
}
[/sourcecode]</pre>

Note how the content of the first occurrence of the #2 tag was used as
<code>_id</code>.

<h2><a name="section5"></a>5. Indexing and querying</h2>

<h3>5.1. Building indexes</h3>

For efficient retrieval, records must be indexed. It is useful to distinguish
between a primary index and secondary indexes. A primary index allows retrieval
via the primary key. Every NoSQL database uses a primary index to allow queries
on the primary key. In fact, “key-value” databases often provide only this
means of retrieval. Document databases also support secondary indexes, allowing
fast access via arbitrary attributes of the records. In order to index those
attributes, some form of data extraction must be performed on the records. By
default, JavaScript is the language used for data extraction in CouchDB<sup><a
name="noteorig12" href="#note12">12</a></sup>. In ISIS databases, a simple but
terse “ISIS Formatting Language” is used to generate secondary indexes, also
called inverted files.

Index definition in ISIS is done in a special file called the “Field Select
Table” (FST). Here are just a few lines from the large FST used to generate
indexes for the LILACS database:

<pre>
98 0 v1,"-"v2
10 8 ,mpl,'|AU_|'(v10^*|%|/),
38 0 if s(mpu,v38^a,mpl):'CD' then 'SP_CD-ROM' fi,
</pre>

Without going into all the details, here is what those lines do (the numbering
is not relevant to this discussion<sup><a name="noteorig13"
href="#note13">13</a></sup>):

<ul>
<li>line 98 indexes the contents of tags #1 (cooperating center code) and #2
(record id) concatenated with an hyphen;</li>
<li>line 10 indexes each word of the main subfield of field #10 (author),
concatenated with an <code>'AU_'</code> prefix;</li>
<li>line 38 indexes the string 'SP_CD-ROM' if field 38,
subfield ^a contains the string 'CD';</li> </ul>

To generate an inverted file, an ISIS system applies each line of the FST to
each record in the database, generating one or more index entries (or postings)
per record. Documents can also be indexed incrementally, but BIREME/PAHO/WHO
does a lot of batch processing and usually does “full inversion”, that is,
regenerating inverted files from scratch.

The overall process is very similar in CouchDB. Instead of an FST, CouchDB
allows us to define “views”, which are the result of running JavaScript
functions to generate indexes. The simplest view contains only one function,
the <code>map</code> function, which receives a document as an argument,
extracts some data from it, and calls a special <code>emit</code> function to
add entries to the index.

The <code>emit</code> function takes two arguments: <code>key</code> and
<code>value</code>. The <code>key</code> argument is the one actually indexed;
queries will be made on it. It is usually unstructured, like a string, but
sometimes it is useful to have an array as a key, for instance <code>[country,
province]</code>, to allow multi-level grouping of results. The
<code>value</code> may be any JSON structure, and it is used to display the
results of a query. For example, in an OPAC application we may have an index
where each entry has an ISBN as a key, and the book title and year of
publication as the value.

If you know some SQL, you may find it easier to think in terms of a simple
SELECT statement:

<pre>[sourcecode language='sql']
SELECT title, year FROM books WHERE isbn='9781594201820'
[/sourcecode]</pre>

Here we have a <code>books</code> table with <code>title</code>,
<code>year</code> and <code>isbn</code> fields. To support a query like the one
above we would need an index on <code>isbn</code>. The <code>title</code> and
<code>year</code> fields have no influence on the search, but they are mentioned
in the query because we want to display them in the search result.

In CouchDB terms, to support a similar query we would need a view with a
<code>map</code> function emitting the ISBN as key, and the value would be a
structure like <code>{"title":"Lords of Finance", "year":2009}</code>. The
<code>map</code> function for that view would look like this:

<pre>[sourcecode language='javascript']
function(doc) {
  if (doc.type == "book") {
    emit(doc.isbn, {title: doc.title, year: doc.year});
  }
}
[/sourcecode]</pre>

Because CouchDB has no concept of tables like SQL databases have, it is common
to have different types of records mixed in the same database. Therefore,
<code>map</code> functions often have an <code>if</code> statement to select
which records to index. In the example above, we index only documents of type
<code>"book"</code>. For each book, the key will be the ISBN and the value will
be an object with the <code>title</code> and <code>year</code> attributes.

In addition to a <code>map</code> function, CouchDB views may have a
<code>reduce</code> function, which is used to aggregate results. Views with
<code>reduce</code> functions will be discussed later.

<h3>Indexing ISIS-JSON type 2 records: first approach</h3>

Being able to use a powerful language like JavaScript to create <code>map</code>
functions gives a lot of flexibility when creating views. We can dig as deep as
necessary into the structure of our documents, and massage the data we find in
sophisticated ways. For one report, we created an index of all the LILACS
records containing fields with repeating subfield markers. That would not be
viable in an ISIS FST. However, one important limitation of <code>map</code>
functions is that we have no access to data that is not part of the document
being indexed. It is impossible, for example, to look up some value in another
record, like a join operation in an RDBMS (ISIS also allows this, via the
<code>l/lookup</code> function of the ISIS Formatting Language).

To start with a simple example, let us create a view to index LILACS records by
tag #1 (cooperating center code), the id of the cataloging institution which
created the record. From the LILACS data dictionary we know that tag #1 is
non-repeating and has no subfields, which means all the content of that field
resides in the main subfield (the first one) and only the first occurrence
matters (because there are no repetitions). Here is a view function that would
create an index with field #1 as key and field #12 (title, analytic) as value:

<pre>[sourcecode language='javascript']
function(doc) {
  emit(doc["1"][0][0][1], doc["12"][0][0][1]);
}
[/sourcecode]</pre>

Here we begin to see the price we pay for the generality of the ISIS-JSON type
2 structure. The LILACS field #1 is non-repeating and has no subfields, so it
could be represented like this:

<pre>[sourcecode language='javascript']
{
    "1" : "UY6.1",
    ...
}
[/sourcecode]</pre>

And then we could reach it within a <code>map</code> function with the simple
expression <code>doc["1"]</code>. Even better, we could add an alpha prefix to
the field name (the <code>--prefix</code> option of <code>isis2json.py</code>
does that). Then we could use dot notation (<code>doc.v1</code>) to access tag
v1 in the following:

<pre>[sourcecode language='javascript']
{
    "v1" : "UY6.1",
    ...
}
[/sourcecode]</pre>

However, to deal with ISIS records of any kind in the absence of schema
information, we must assume that every field may have more than one occurrence,
therefore its value cannot be just a string, but an array of occurrences. In
addition, we must assume that every field may have subfields, therefore each
occurrence must be structured as an associative list (ISIS-JSON type 2) or a
dictionary (ISIS-JSON type 3), unless we want to parse the subfields every time
when indexing. In an ISIS-JSON type 2 record, field #1 is represented as:

<pre>[sourcecode language='javascript']
{
    "1": [
        [
            [
                "_",
                "UY6.1"
            ]
        ]
    ],
    ...
}
[/sourcecode]</pre>

And to access its value we must write <code>doc["1"][0][0][1]</code>. Here is
why:

<pre>[sourcecode]
doc   →  {"1": [[["_", "UY6.1"]]]}      # entire document
doc["1"]   →   [[["_", "UY6.1"]]]     # list of occurrences of field #1
doc["1"][0]  →  [["_", "UY6.1"]]    # list of subfields of occurrence 0 of #1
doc["1"][0][0] → ["_", "UY6.1"]   # subfield 0 of occurrence 0 of #1 (key "_")
doc["1"][0][0][1]   →  "UY6.1"  # value of subfield 0 of occurrence 0 of #1
[/sourcecode]</pre>


This quickly becomes tedious, and very burdensome in more complex cases, for
instance, to retrieve a specific subfield aside from the main one we would need
to iterate over the subfield keys. We will soon show a library to make it easier
to handle ISIS-JSON type 2 records, but first let's return to our simple
<code>map</code> function:

<pre>[sourcecode language='javascript']
function(doc) {
  emit(doc["1"][0][0][1], doc["12"][0][0][1]);
}
[/sourcecode]</pre>

This function is part of a view in a "design document" in CouchDB. A design
document is written in JSON, stored in CouchDB as other documents are, but its
identifier starts with <code>_design/</code>. It may contain JavaScript
functions for creating views as well as formatting output, validating document
inserts/updates, etc. Each CouchDB database may have several design documents,
and each design document may have several views.

When trying out CouchDB initially, the easiest way to create a view is by using
the "Temporary view..." option of the view dropdown in the top right area of the
Futon interface. Then you can run the <code>map/reduce</code> functions and
quickly see their results (if your dataset is not too large). To save your work
to a permanent view, you will be prompted to provide a design document name and
a view name.

For any serious work, the CouchApp tool is highly recommended (<a
href="#anderson2009">Anderson, 2009</a>). With it you can develop your design
documents in your local filesystem, using your favorite editor and version
control system, then push your code to a local or remote CouchDB instance with a
command. All of the views for this article were developed in this manner. The
code is in Bitbucket (<a href="#ramalho2011">Ramalho, 2011</a>).

<div style="width: 500px; margin: auto;"><a
href="/media/issue13/ramalho/fig2-tempview.png"><img class="caption"
src="/media/issue13/ramalho/fig2-tempview-500x399.png"
alt="Futon screen shot: testing a temporary view" /></a></div> <p
class="caption"><strong>Figure 2.</strong> Editing and running a temporary view
in CouchDB Futon.</p>

When a view is first visited via HTTP, CouchDB indexes all of the documents by
applying the <code>map</code> function to each of them. CouchDB also
incrementally updates the indexes if documents are inserted or updated. The
indexing is only done on demand, when a view is actually requested, and not when
documents are created or changed.

To install the <code>map</code> function above, we created a design document
called <code>lilacs</code> and within it a view called <code>center</code> (for
"cooperating center"). Here is part of the result of requesting that view with
the cURL utility:

<pre>
$ curl -s http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center | head -5
{"total_rows":926,"offset":0,"rows":[
{"id":"559682","key":"AR1.1","value":"Hemorragia digestiva baja: [revisi\u00f3n]"},
{"id":"559683","key":"AR1.1","value":"La radiolog\u00eda del colon por enema en el siglo XXI"},
{"id":"559684","key":"AR1.1","value":"El s\u00edndrome an\u00e9mico en las neoplasias del colon derecho"},
{"id":"559685","key":"AR1.1","value":"Met\u00e1stasis cerebrales del carcinoma epidermoide del ano: comunicaci\u00f3n de un caso"},
</pre>

Note that the result comes as a JSON object with three properties:
<code>total_rows</code>, <code>offset</code> and <code>rows</code>, the latter
being an array of <code>map</code> function results. Besides the key and value
generated by the <code>emit</code> function, each result row also has an
<code>id</code> attribute, which carries the <code>_id</code> property of the
corresponding indexed document. By default, the result is sorted in ascending
key order. Here we used the <code>head</code> shell command to crop the
displayed results; we will soon see a way of limiting the results actually sent
by CouchDB.

By the way, the URL shown here is public and you should be able to access it to
try the examples by editing the URL:

<a href="http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center"
>http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center</a>

However, please note that the <code>lilcouch</code> database there contains only
a sample of 1000 records, not the full LILACS database.

<h3>5.2. Querying a view</h3>

As we have just seen, CouchDB queries are executed by making HTTP requests on
views. If no arguments are passed, all rows of the result are returned.
However, there are several arguments that can be used to filter the results.
For example, the following query uses the <code>descending</code> and
<code>limit</code> arguments. Note the use of single quotes around the URL,
necessary because "&" is a shell operator and we are using the command line. In
this case the <code>head</code> shell command was not used. The
<code>total_rows</code> property still counts 926, but only the last three rows
are returned because of the <code>limit=3</code> option.

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?descending=true&limit=3'
{"total_rows":926,"offset":0,"rows":[
{"id":"560003","key":"UY6.1","value":"Vacunaci\u00f3n antigripal en personal de salud del Hospital Pedi\u00e1trico del Centro Hospitalario Pereira Rossell: evoluci\u00f3n de la cobertura del a\u00f1o 2006 al 2008"},
{"id":"560002","key":"UY6.1","value":"Contin\u00faa descendiendo la mortalidadpor asma en Uruguay: per\u00edodo 1984-2008"},
{"id":"560001","key":"UY6.1","value":"Localizaci\u00f3n de lesiones mamarias subcl\u00ednicas con marcador met\u00e1lico (arponaje): an\u00e1lisis de los m\u00e1rgenes quir\u00fargicos"}
]}
</pre>

Another possibility is to filter by key. The next query returns only the
documents created by the cooperating center with code "CO113":

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?key="CO113"'
{"total_rows":926,"offset":744,"rows":[
{"id":"559986","key":"CO113","value":"De serendipia al tratamiento quir\u00fargico de la diabetes mellitus tipo II"},
{"id":"559987","key":"CO113","value":"La Academia Nacional de Medicina se pronuncia sobre la emergencia social, la medicina y la salud"},
{"id":"559988","key":"CO113","value":"Presentaci\u00f3n inicial de las pacientes con diagn\u00f3stico de c\u00e1ncer de seno en el Centro Javeriano de Oncolog\u00eda, Hospital Universitario San Ignacio"},
{"id":"559989","key":"CO113","value":"\u00bfEs la diabetes mellitus tipo 2 una enfermedad de tratamiento quir\u00fargico?"},
{"id":"559990","key":"CO113","value":"P\u00f3lipos de la ves\u00edcula"},
{"id":"559991","key":"CO113","value":"Ruptura espl\u00e9nica espont\u00e1nea asociada a linfoma perif\u00e9rico de c\u00e9lulas T, presentaci\u00f3n de un caso y revisi\u00f3n de la literatura"},
{"id":"559992","key":"CO113","value":"Diverticulosis del yeyuno: complicaciones y manejo; reporte de caso y revisi\u00f3n de la literatura"},
{"id":"559993","key":"CO113","value":"Actinomicosis abdominal y p\u00e9lvica: reto diagn\u00f3stico y quir\u00fargico para el cirujano general"}
]}
</pre>

It is also possible to limit by starting and ending keys. For example, this
query returns the documents created by centers starting with the "CO" prefix
(for Colombia), up to and including the "CO149" center. Note that the
<code>offset</code> property tells us that 744 rows were skipped to get to the
first one with <code>key="CO"</code>:

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?startkey="CO"&endkey="CO149"'
{"total_rows":926,"offset":744,"rows":[
{"id":"559986","key":"CO113","value":"De serendipia al tratamiento quir\u00fargico de la diabetes mellitus tipo II"},
{"id":"559987","key":"CO113","value":"La Academia Nacional de Medicina se pronuncia sobre la emergencia social, la medicina y la salud"},
[... several rows omitted ...]
{"id":"560469","key":"CO149","value":"Diagn\u00f3stico sanitario de diversos zoocriaderos helic\u00edcolas en Colombia: determinaci\u00f3n de los principales agentes pat\u00f3genos que afectan el caracol Helix aspersa (O.F. Muller, 1774) en cada etapa de ciclo biol\u00f3gico"},
{"id":"560470","key":"CO149","value":"La salud animal y la globalizaci\u00f3n: el desaf\u00edo de pol\u00edticas sostenibles y equitativas en el contexto de los pa\u00edses en desarrollo"}
]}
</pre>

<h3>5.3. Indexing ISIS-JSON type 2 records: using an API</h3>

Now we will develop a view called <code>au_countries</code> to list the
countries of origin of the authors cited in the sample. This entails fetching
the value of the "^p" subfield of field #10. A first stab at the
<code>map</code> function came out like this:

<pre>[sourcecode language='javascript']
function(doc) {
  var occur, subf;
  if (doc.hasOwnProperty('10')) {
    for (occur=0; occur&lt;doc['10'].length; occur++) {
      for (subf=0; subf&lt;doc['10'][occur].length; subf++) {
        if (doc['10'][occur][subf][0] === 'p') {
          emit(doc['10'][occur][subf][1], 1);
          break;
        }
      }
    }
  } else {
    emit(null, 1);
  }
}
[/sourcecode]</pre>

Yes, that is quite ugly. Much of the ugliness comes from the fact that
ISIS-JSON type 2 arranges subfields in a associative list, where a key can only
be found through a linear search.

So here is the next solution, much simpler thanks to the use of a custom
library:

<pre>[sourcecode language='javascript']
function(doc) {
  // !code vendor/isisdm_t2/_attachments/isisdm.js
  var i, occurs = ISIS.getallsub(doc, 10, "p");
  for (i=0; i&lt;occurs.length; i++) {
    if (occurs[i] !== undefined) {
      emit(occurs[i], 1);
    }
  }
}
[/sourcecode]</pre>

The first line of the function is a special <code>!code</code> comment that
works as an include, and is processed by the CouchApp utility. This is actually
a workaround because at this time CouchDB does not support the use of the
CommonJS <code>require</code> function to import modules in <code>map</code> or
<code>reduce</code> functions (<code>require</code> is supported elsewhere in
CouchDB, in show and list functions for example). The <code>isisdm.js</code>
module used here is part of the ISIS Data Model APIs we are developing at
BIREME/PAHO/WHO to handle ISIS records in JavaScript and Python. Currently
<code>isisdm.js</code> defines four main functions:

<dl>
<dt><code>get(record, tag, missing)</code></dt>
    <dd>returns first occurrence of tag in record, otherwise returns missing
    value;</dd>
<dt><code>getall(record, tag)</code></dt>
    <dd>returns all occurrences of tag in record or an empty list if the tag is
    not found;</dd>
<dt><code>getsub(record, tag, key, missing)</code></dt>
    <dd>returns subfield identified by key in the first occurrence of tag in
    record, otherwise returns missing value;</dd>
<dt><code>getallsub(record, tag, key, missing)</code></dt>
    <dd>returns an array with contents of subfield identified by key in all
    occurrences of tag in record;</dd>
</dl>

We are using QUnit (<a href="#jquery2011">JQuery, 2011</a>) to perform unit
tests on the <code>isisdm.js</code> module. Here is the the test page:

<div style="width: 500px; margin: auto;"><a
href="/media/issue13/ramalho/fig3-qunit.png"><img class="caption"
src="/media/issue13/ramalho/fig3-qunit-500x394.png"
alt="Futon screen shot: viewing a record" /></a></div> <p
class="caption"><strong>Figure 3.</strong> QUnit test results page, showing that
all tests are passing. The tests for the <code>getallsub</code> function are
expanded. This page is live at <a
href="http://ramalho.couchone.com/lilcouch/_design/lilacs/tests.html"
>http://ramalho.couchone.com/lilcouch/_design/lilacs/tests.html</a></p>

Now, back to the <code>map</code> function, note that the call to
<code>emit</code> uses the subfield occurrence as key and the value is just a
number 1: <code>emit(occurs[i], 1);</code>. This is because the intent of this
view is to produce an aggregate count of each different key. To achieve this, we
need a <code>reduce</code> function to sum the values emitted, like this:

<pre>[sourcecode language='javascript']
function(keys, values, rereduce) {
  return sum(values);
}
[/sourcecode]</pre>

Now that our <code>au_countries</code> view has both a <code>map</code> and
<code>reduce</code> function, we can query it:

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries'
{"rows":[
{"key":null,"value":2368}
]}
</pre>

The simplest, no-argument request returns null as the key and 2368 as the value.
This is the result of the <code>reduce</code> function: 2368 is the sum of all
the 1's emitted by the <code>map</code> function. In other words, there are 2368
occurrences of the
#10 tag with a <code>^p</code> subfield.

The <code>reduce</code> function can be disabled if the
<code>reduce=false</code> option is given in the request. Below are the first 5
and the last 5 lines of querying the <code>au_countries</code> view without
applying the <code>reduce</code>. What you see is just the result of the
<code>map</code> function:

<pre>
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries?reduce=false' | head -5
{"total_rows":2368,"offset":0,"rows":[
{"id":"560162","key":"\u00c1frica do Sul","value":1},
{"id":"560162","key":"\u00c1frica do Sul","value":1},
{"id":"560013","key":"Alemania","value":1},
{"id":"559596","key":"Algeria","value":1},
$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries?reduce=false' | tail -5
{"id":"560034","key":"USA","value":1},
{"id":"560325","key":"USA","value":1},
{"id":"559604","key":"Venezuela","value":1},
{"id":"559957","key":"Venezuela","value":1}
]}
</pre>

A more interesting result can be obtained if we pass the <code>group=true</code>
option:

<pre>$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries?group=true&limit=10'
{"rows":[
{"key":"\u00c1frica do Sul","value":2},
{"key":"Alemania","value":1},
{"key":"Algeria","value":1},
{"key":"Argentina","value":289},
{"key":"Austr\u00e1lia","value":2},
{"key":"Bolivia","value":1},
{"key":"Brasi\u00e7","value":1},
{"key":"Brasil","value":1185},
{"key":"Brazil","value":26},
{"key":"Brsil","value":1}
]}
</pre>

The result set above, limited to 10 rows, shows the number of occurrences of
each key. Obviously there are four entries for Brazil, with alternate and
incorrect spellings. This would have to be dealt with elsewhere. The key point
here is that the <code>reduce</code> function and the group option work together
to produce aggregate results, similar to the ones we can produce with the SQL
GROUP BY clause in a relational DBMS.

<h2><a name="section6"></a>6. Results and Conclusion</h2>

<h3>6.1. Results</h3>

We have identified in CouchDB and MongoDB two modern, Open Source database
systems which are suitable for semistructured records like those defined by the
ISO-2709 standard and the ISIS family of systems, serving the needs of MARC and
LILACS datasets.

Furthermore, we created a tool to convert ISIS records from the ISO-2709 format
to JSON documents suitable for loading into CouchDB (or MongoDB, though we have
not shown that in this paper).

We considered a number of alternative representations for ISIS data in JSON
format. In this paper we used the type 2 representation which, though
somewhat awkward to work with, preserves subfield ordering and allows for
repeated subfields, a feature of MARC records. ISIS fields do not have
indicators, so we have not discussed how to represent them in JSON. One
possibility would be to use special subfield markers, like "_1" and "_2",
attached to field occurrences where the indicators are present.

Finally, we developed a JavaScript library to make it easy to handle ISIS-JSON
type 2 records when creating CouchDB views.

<h3>6.2. Conclusion</h3>

These experiments and developments have shown that it is easy to import
ISO-2709 data into a document database like CouchDB or MongoDB. After doing
so, it becomes almost trivial to create Web services to publish the data, using
just JavaScript, particularly in CouchDB thanks to its native support for JSON
over HTTP.

While the semistructured data model was only formalized in the mid 1990's, the
ISO-2709 and ISIS record formats have always been concrete, albeit limited,
examples of it. Research into that model includes results such as algorithms to
extract a formal schema from actual datasets, methods for dealing with shared or
duplicate data, and a normal form adapted to semistructured schemas (<a
href="tok2005">Tok, 2005</a>). We have much to learn and apply from
semistructured data research into our daily work with bibliographic records.

At BIREME/PAHO/WHO we continue investigating the challenges and opportunities
of converting from the ISIS legacy systems to modern Open Source document
databases. Meanwhile, we are also developing new applications – not limited to
the ISIS data model and legacy data – using CouchDB, JavaScript, Python and the
Pyramid framework. Pyramid is a new web framework resulting from the merger of
the Pylons and Repoze projects, and it includes Deform, a form generation and
validation library with strong support for repeating fields and nested forms,
important features when building interfaces for semistructured data entry (<a
href="#pylons2011">Pylons, 2011</a>). These new developments allow us to think
about how we want the LILACS bibliographic database to operate in the year 2015,
when its 30th anniversary will be celebrated.

<h2>Notes</h2>

<a name="note1" href="#noteorig1"><sup>1</sup></a> The qualification
“<em>flat</em> relational model” is used here because there is actually a
controversy about the meaning of the First Normal Form. C. J. Date states that
the atomicity requirement of the 1NF is pointless, and that since “types can be
anything” it follows that “all relations are in first normal form by definition”
(<a href="#date2005">Date, 2005</a>, p. 37). But E. F. Codd did define
normalization as the process of removing non-simple domains from relations (<a
href="#codd1970">Codd, 1970</a>), and current database textbooks define the 1NF
as Codd did (<a href="#elmasri2006">Elmasri, 2006</a>; <a
href="#silberschatz2006">Silberschatz, 2006</a>). Leaving aside the theory,
although some RDBMS – like PostgreSQL and Oracle – support array values and even
user-defined composite field values, modern database-independent access methods,
such as the Ruby on Rails ActiveRecord ORM, often target the lowest common
denominator, and that is the flat relational model.

<a name="note2" href="#noteorig2"><sup>2</sup></a> “The structure, or 'empty
container', the content designators (tags, indicators, and subfield codes) used
to explicitly identify or additionally characterize the data elements, and the
content, the data itself (author's name, titles, etc.) are the three components
of the [MARC II] format.” (<a href="#avram1975">Avram, 1975</a>)

<a name="note3" href="#noteorig3"><sup>3</sup></a> The term “semistructured”
often appears as “semi-structured” in the literature. Searching for both
spellings is recommended because Google returns more results for the hyphenated
spelling, but some key proponents of the model use just one word (<a
href="#abiteboul1999">Abiteboul, 1999</a>; <a href="#tok2005">Tok, 2005</a>; <a
href="#suciu2001">Suciu, 2001</a>). Here I use one word, but kept the source
spelling in quotations.

<a name="note4" href="#noteorig4"><sup>4</sup></a> JSON is JavaScript Object
Notation, a data exchange format described as a light-weight alternative do XML
(<a href="#crockford2006b">Crockford, 2006b</a>). Since 2006 JSON is an Internet
standard, formalized by RFC4627 (<a href="#crockford2006a">Crockford,
2006a</a>).

<a name="note5" href="#noteorig5"><sup>5</sup></a> In MARC records, indicators
are two single-digit positions located between the tag number and the field
content, which have different uses depending on the tag number.

<a name="note6" href="#noteorig6"><sup>6</sup></a> ISIS datasets sometimes
contain repeated subfields due to typos and processing errors. However, the ISIS
Formatting Language, used to generate indexes and displays, is only capable of
extracting the first occurrence of a subfield.

<a name="note7" href="#noteorig7"><sup>7</sup></a> From the XML standard (<a
href="#w3c1998">W3C, 1998</a>): “An element type has mixed content when elements
of that type may contain character data, optionally interspersed with child
elements.” Mixed content also exists in HTML and in SGML, their common ancestor.

<a name="note8" href="#noteorig8"><sup>8</sup></a> Here we mean “durability” as
in the “D” in the ACID database properties, meaning that once an insert or
update is committed, it is written to disk. For a discussion on why the
designers of MongoDB initially compromised on single-server durability, see
“What about Durability?“ (<a href="#mongodb2010">MongoDB, 2010</a>). Update: as
of March 16, 2011, MongoDB 1.8 is released and includes write-ahead journaling.
“With journaling enabled, crash recovery is fast and safe” (<a
href="#mongodb2011">MongoDB, 2011</a>).

<a name="note9" href="#noteorig9"><sup>9</sup></a> “An object is an unordered
collection of zero or more name/value pairs” (<a
href="#crockford2006a">Crockford, 2006a</a>).

<a name="note10" href="#noteorig10"><sup>10</sup></a> Association lists should
not to be confused with associative arrays, such as those in PHP. A PHP
associative array is like a hash in Perl or Ruby, a Python dictionary or a
JavaScript or JSON object. In contrast, an association list, or <em>alist</em>,
is not a primitive type in those languages, but can be built as an array or list
of key-value pairs, where each pair is also an array or list. In Python, a more
readable and convenient representation would be a list of tuples: <code>[(key1,
value1), (key2, value2), ...]</code>. Association lists originated in Lisp and
are common in Scheme (<a href="#mit2010">MIT, 2010</a>).

<a name="note11" href="#noteorig11"><sup>11</sup></a> As far as we know, apart
from HTTP, the only other way to interact with CouchDB is via Hovercraft, an
Erlang library which provides direct access to the database.

<a name="note12" href="#noteorig12"><sup>12</sup></a> CouchDB indexes are
generated by “view servers” which can be written in any language. An Erlang view
server is bundled with recent versions of CouchDB, but is not enabled by
default. Python can be easily configured for that purpose, and Java is also
known to be used. A list of view server implementations can be found at <a
href="http://wiki.apache.org/couchdb/View_server"
>http://wiki.apache.org/couchdb/View_server</a>.

<a name="note13" href="#noteorig13"><sup>13</sup></a> The numbers in the first
column (98, 10, 38 in the example) identify a specific index, usually coinciding
with the indexed tag number, but not necessarily. Some versions of ISIS support
advanced querying using those numbers, others do not.

<h2>References</h2>

<a name="abiteboul1999"></a>ABITEBOUL, Serge; BUNEMAN, Peter; SUCIU, Dan. Data
on the Web: From Relations to Semistructured Data and XML. San Francisco: Morgan
Kaufmann, 1999.

<a name="anderson2009"></a>ANDERSON, C.; CHESNEAU, B. CouchApp: Standalone
CouchDB Application Development Made Simple, 2009 [cited 2011 Mar. 26].
Available from: <a href="https://github.com/couchapp/couchapp/"
>https://github.com/couchapp/couchapp/</a>.

<a name="apache2011a"></a>APACHE FOUNDATION. The Apache Cassandra Project [cited
2011 Mar. 26]. Available from: <a
href="http://cassandra.apache.org/">http://cassandra.apache.org/</a>.

<a name="apache2011b"></a>APACHE FOUNDATION. The Apache CouchDB Project [cited
2011 Mar. 26]. Available from: <a
href="http://couchdb.apache.org/">http://couchdb.apache.org/</a>.

<a name="avram1975"></a>AVRAM, Henriette D. MARC: Its History and Implications.
Washington, DC: U.S. Government Printing Office, 1975.

<a name="barbieri2011"></a>BARBIERI, Heitor. Bruma (Java library source code).
São Paulo: BIREME/OPAS/OMS, 2010 [cited 2011 Mar. 26]. Available from: <a
href="http://reddes.bvsalud.org/projects/isisnbp/browser/Bruma"
>http://reddes.bvsalud.org/projects/isisnbp/browser/Bruma</a>.

<a name="bireme2008"></a>BIREME/OPAS/OMS. Diccionario de Datos del Modelo LILACS
Versión 1.6a. São Paulo: BIREME/OPAS/OMS, 2008 [cited 2011 Mar. 26]. Available
from: <a
href="http://bvsmodelo.bvs.br/download/lilacs/LILACS-5-DicionarioDados-es.pdf"
>http://bvsmodelo.bvs.br/download/lilacs/LILACS-5-DicionarioDados-es.pdf</a>

<a name="bireme2010"></a>BIREME/OPAS/OMS. ISIS-JSON types. São Paulo:
BIREME/OPAS/OMS, 2010 [cited 2011 Mar. 21]. Available from: <a
href="http://reddes.bvsalud.org/projects/isisnbp/wiki/ISIS-JSON_types"
>http://reddes.bvsalud.org/projects/isisnbp/wiki/ISIS-JSON_types</a>.

<a name="chang2006"></a>CHANG, Fay; DEAN, Jeffrey; GHEMEWAT, Sanjay; et al.
Bigtable: A Distributed Storage System for Structured Data. In: OSDI'06: Seventh
Symposium on Operating System Design and Implementation, Seattle, WA, 2006
[cited 2011 Mar. 21]. Available from: <a
href="http://labs.google.com/papers/bigtable.html"
>http://labs.google.com/papers/bigtable.html</a>.

<a name="codd1970"></a>CODD, E. F. A Relational Model of Data for Large Shared
Data Banks. Communications of the ACM, v. 13, n. 6, p. 377-387, jun. 1970 [cited
2011 Mar. 26]. Available from: <a
href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.5286&rep=rep1&type=pdf"
>http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.98.5286&rep=rep1&type=pdf</a>.

<a name="crockford2006a"></a>CROCKFORD, D. RFC4627: The application/json Media
Type for JavaScript Object Notation (JSON). The Internet Society, 2006 [cited
2011 Mar. 21]. Available from: <a href="http://tools.ietf.org/html/rfc4627"
>http://tools.ietf.org/html/rfc4627</a>.

<a name="crockford2006b"></a>CROCKFORD, D. JSON: The Fat-Free Alternative to XML.
JSON.ORG, 2006 [cited 2011 Mar. 21]. Available from: <a
src="http://www.json.org/fatfree.html"> http://www.json.org/fatfree.html</a>.

<a name="date2005"></a>DATE, C. J. Database in Depth: Relational Theory for
Practitioners. Sebastopol: O'Reilly Media, 2005.

<a name="decandia2007"></a>DECANDIA, Giuseppe; HASTORUN, Deniz; JAMPANI, Madan;
et al. Dynamo: Amazon's Highly Available Key-Value Store. In: Proceedings of the
21st ACM Symposium on Operating Systems Principles, Stevenson, WA, 2007.

<a name="dirolf2010"></a>DIROLF, M. Bin­ary JSON. BSONSPEC.ORG, 2010 [cited 2011
Mar. 21]. Available from: <a src="http://bsonspec.org/#/specification"
>http://bsonspec.org/#/specification</a>.

<a name="elmasri2006"></a>ELMASRI, R; NAVATHE, S. B. Fundamentals of Database
Systems. 5<sup>th</sup> ed. Reading, MA: Addison-Wesley, 2006.

<a name="hopkinson2005"></a>HOPKINSON, Alan. CDS/ISIS: The second decade.
Information Development. February 2005 vol. 21 no. 1 p.31-37 [cited 2011 Mar.
26]. Available from: <a
href="http://eprints.mdx.ac.uk/2700/3/isisseconddecade.pdf"
>http://eprints.mdx.ac.uk/2700/3/isisseconddecade.pdf</a>.

<a name="jquery2011"></a>JQUERY PROJECT. QUnit, 2011. [cited 2011 Mar. 21].
Available from: <a src="http://docs.jquery.com/Qunit"
>http://docs.jquery.com/Qunit</a>

<a name="lopes2010"></a>LOPES, Francisco. Historia_ISIS.doc (internal memo). São
Paulo: BIREME/OPAS/OMS, 2010.

<a name="merriman2010"></a>MERRIMAN, Dwight. Comparing Mongo DB and Couch DB.
MongoDB.org. [cited 2011 Mar. 21]. Available from: <a
src="http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB"
>http://www.mongodb.org/display/DOCS/Comparing+Mongo+DB+and+Couch+DB">.

<a name="mit2010"></a>MIT. MIT/GNU Scheme - Association Lists. Cambridge, MA:
Massachusetts Institute of Technology, 2010 [cited 2011 Mar. 22]. Available
from: <a
href="http://web.mit.edu/scheme_v9.0.1/doc/mit-scheme-ref/Association-Lists.html"
>http://web.mit.edu/scheme_v9.0.1/doc/mit-scheme-ref/Association-Lists.html</a>.

<a name="mongodb2010"></a>MONGODB.ORG. What about Durability? MongoDB Blog, 2010
[cited 2011 Mar. 22]. Available from: <a
href="http://blog.mongodb.org/post/381927266/what-about-durability"
>http://blog.mongodb.org/post/381927266/what-about-durability</a>.

<a name="mongodb2011"></a>MONGODB.ORG. MongoDB 1.8 Released. MongoDB Blog, 2011
[cited 2011 Mar. 22]. Available from: <a
href="http://blog.mongodb.org/post/3903149313/mongodb-1-8-released"
>http://blog.mongodb.org/post/3903149313/mongodb-1-8-released</a>.

<a name="pylons2011"></a>PYLONS PROJECT. Deform. Agendaless Consulting, 2011
[cited 2011 Mar. 22]. Available from: <a
"http://docs.pylonsproject.org/projects/deform/dev/#"
>http://docs.pylonsproject.org/projects/deform/dev/</a>.

<a name="ramalho2011"></a>RAMALHO, L. LILACS on CouchDB, 2011 [cited 2011 Mar.
26]. Available from: <a href="https://bitbucket.org/ramalho/lilcouch"
>https://bitbucket.org/ramalho/lilcouch</a>

<a name="silberschatz2006"></a>SILBERSCHATZ, A.; KORTH, H.; Sudarshan, S.
Database System Concepts, 5<sup>th</sup> Ed. New York: McGraw-Hill, 2006.

<a name="suciu2001"></a>SUCIU, Dan. Managing XML and Semistructured Data -
Lecture Series (digital slides). Seattle: Department of Computer Science &
Engineering, University of Washington, 2001 [cited 2011 Mar. 21]. Available
from: <a href="http://www.cs.washington.edu/homes/suciu/COURSES/590DS/"
>http://www.cs.washington.edu/homes/suciu/COURSES/590DS/</a>

<a name="suciu2009"></a>SUCIU, Dan. Semi-Structured Data Model. In: LIU, L.;
ÖZSU, M. T. Encyclopedia of Database Systems: Springer, 2009. p. 2601-2605.

<a name="thomale2010"></a>THOMALE, Jason. Interpreting MARC: Where's the
Bibliographic Data? Code4Lib Journal, Issue 11, 2010-09-21 [cited 2011 Mar. 26].
Available from: <a href="http://journal.code4lib.org/articles/3832"
>http://journal.code4lib.org/articles/3832</a>.

<a name="tok2005"></a>TOK, Wang Ling; LEE, Mong Li; DOBBIE, Gillian.
Semistructured Database Design. Boston: Springer Science, 2005.

<a name="w3c1998"></a>W3C. Extensible Markup Language (XML) 1.0 (Fifth Edition),
2008. [cited 2011 Mar. 26]. Available from: <a
href="http://www.w3.org/TR/REC-xml/#sec-mixed-content"
>http://www.w3.org/TR/REC-xml/#sec-mixed-content</a>