Commits

Gabriel Farrell committed eb2898f

Last big round of copy edits on first draft

Comments (0)

Files changed (1)

code4lib13/isis2couchdb.txt

 
 Except for the repetition of the "email" key, it's very similar to JSON, or
 JavaScript Object Notation (Crockford 2006). The presence of keys describing
-fields like "name," "first," etc., and the storage of those keys alongside the
+fields like "name", "first", etc., and the storage of those keys alongside the
 data, is what the encyclopedia entry means by coexisting data values and schema
 components (Liu, 2009). It is a characteristic shared by ISO-2709, XML, and
 JSON (with the limitation that ISO-2709 fields are identified by tags composed
 as deep as graph databases (designed to allow general queries over paths of
 nested objects). What they offer is somewhere in between: JSON-like records
 allowing nested structures, and expressive query languages to index and retrieve
-those records. Those rich records are called “documents.” CouchDB and MongoDB
-call themselves “document databases.”
+those records. Those rich records are called “documents”. CouchDB and MongoDB
+call themselves “document databases”.
 
 In the case of CouchDB, the document format is JSON. MongoDB uses BSON, a binary
 format inspired by JSON but offering more data types, such as int32, datetime,
 appeared before all occurrences of field #10, while in the JSON representation
 the key “10” precedes key “2”. This is irrelevant in practice. But, crucially,
 the order of the authors is the same in both formats: first “Kanda, Paulo
-Afonso,” then “Smidth, Magali Taino.
+Afonso”, then “Smidth, Magali Taino”.
 
 Regarding subfield ordering within a field, the LILACS data dictionary does
 establish a canonical ordering. For example, in the case of field #10 the order
 </pre>
 
 If an _id attribute is not present in an inserted document, CouchDB provides
-one, filled with a UUID (Universally Unique Identifier) like
+one with a UUID (Universally Unique Identifier) like
 "ead3af23a4459b2d7a1aef05cb0012a9". It is highly recommended that an _id is
 given when adding documents to prevent inadvertent duplication of records if a
 bulk loading process is interrupted and restarted. Therefore isis2json.py
 provides the -i option, used in the examples above, to fetch the value of one
 field in the ISIS input and use it as the _id attribute.
 
-So for example using the "-i 2" option this (partial) ISIS structure:
+So, for example, using the "-i 2" option this (partial) ISIS structure:
 
 <pre>
    1 «BR1.1»
 <pre>[sourcecode language='javascript']
 {
     "_id": "538886",
-   "1": [
-       [
-           ["_", "BR1.1"]
-       ]
-   ],
-   "2": [
-       [
-           ["_", "538886"]
-       ]
-   ],
-   "4": [
-       [
-           ["_", "LILACS"]
-       ],
-       [
-           ["_", "LLXPEDT"]
-       ]
-   ]
+    "1": [
+        [
+            ["_", "BR1.1"]
+        ]
+    ],
+    "2": [
+        [
+            ["_", "538886"]
+        ]
+    ],
+    "4": [
+        [
+            ["_", "LILACS"]
+        ],
+        [
+            ["_", "LLXPEDT"]
+        ]
+    ]
 }
 [/sourcecode]</pre>
 
 
 The overall process is very similar in CouchDB. Instead of an FST, CouchDB
 allows us to define "views", which are the result of running JavaScript
-functions to generate indexes (other languages may be used, as noted before).
-The simplest view contains only one function, called "map", which receives a
-document as an argument, and may call a special emit function to add entries to
-the index.
+functions to generate indexes (other languages may also be used, but JavaScript
+is the default). The simplest view contains only one function, called "map",
+which receives a document as an argument, and may call a special emit function
+to add entries to the index.
 
 The emit function takes two arguments: key and value. The key argument is the
 one actually indexed; queries will be made on it. It is usually unstructured,
 Here we have a books table with title, year and isbn fields. To support a query
 like the one above we would need an index on isbn. The title and year fields
 have no influence on the search, but they are mentioned in the query because we
-want to display them on the search result.
+want to display them in the search result.
 
 In CouchDB terms, to support a similar query we would need a view with a map
 function emitting the ISBN as key, and the value would be a structure like
 }
 [/sourcecode]</pre>
 
-Because CouchDB has no concept of tables like SQL databases have, it is common to
-have different types of records mixed in the same database. Therefore, map
+Because CouchDB has no concept of tables like SQL databases have, it is common
+to have different types of records mixed in the same database. Therefore, map
 functions often have an if clause to select which records to index. In the
 example above, we index only documents of type "book". For each book, the key
-will be the ISBN and the value will be a object with the title and year
+will be the ISBN and the value will be an object with the title and year
 attributes.
 
-Besides a map function, CouchDB views may also have a reduce function, which is
-used to aggregate results. Views with reduce functions will be discussed later.
+In addition to a map function, CouchDB views may have a reduce function, which
+is used to aggregate results. Views with reduce functions will be discussed
+later.
 
 <h3>Indexing ISIS-JSON type 2 records: first approach</h3>
 
 Being able to use a powerful language like JavaScript to create map functions
 gives a lot of flexibility when creating views. We can dig as deep as necessary
 into the structure of our documents, and massage the data we find in
-sofisticated ways. For one report, we created an index of all the LILACS records
-containing fields with repeating subfield markers. That would be impossible to
-do in an ISIS FST.
+sophisticated ways. For one report, we created an index of all the LILACS
+records containing fields with repeating subfield markers. That would be
+impossible to do in an ISIS FST.
 
 To start with a simple example, let us create a view to index LILACS records by
-by tag #1 (cooperating center code), the id of the cataloguing institution which
+tag #1 (cooperating center code), the id of the cataloguing institution which
 created the record. From the LILACS data dictionary we know that tag #1 is
 non-repeating and has no subfields, which means all the content of that field
 resides in the main subfield (the first one) and only the first occurrence
 }
 [/sourcecode]</pre>
 
-Now we begin to see the price to pay for the generality of the ISIS-JSON type 2
-structure. The LILACS field #1 is non-repeating and has no subfields, so it
+Here we begin to see the price we pay for the generality of the ISIS-JSON type
+2 structure. The LILACS field #1 is non-repeating and has no subfields, so it
 could be represented like this:
 
 <pre>[sourcecode language='javascript']
 [/sourcecode]</pre>
 
 And then we could reach it within a map function with the simple expression
-<pre>doc["1"]</pre>. Even better, we could add an alpha prefix to the field name
-(the --prefix option of isis2json.py does that). Then we could use dot notation,
-that is, <pre>doc.v1</pre> to access tag v1 in:
+<code>doc["1"]</code>. Even better, we could add an alpha prefix to the field
+name (the --prefix option of isis2json.py does that). Then we could use dot
+notation (<code>doc.v1</code>) to access tag v1 in the following:
 
 <pre>[sourcecode language='javascript']
 { 
 }
 [/sourcecode]</pre>
 
-However, to be able to deal with ISIS records of any kind in the absence of
-schema information, we must assume that every field may have more than one
-occurrence, therefore its value cannot be just a string, but an array of
-occurrences. In addition, we must assume that every field may have subfields,
-therefore each occurrence must be structured into an associative list (ISIS-JSON
-type 2) or a dictionary (ISIS-JSON type 3), unless we want to parse the
-subfields every time when indexing. Therefore field #1 is represented as:
+However, to deal with ISIS records of any kind in the absence of schema
+information, we must assume that every field may have more than one occurrence,
+therefore its value cannot be just a string, but an array of occurrences. In
+addition, we must assume that every field may have subfields, therefore each
+occurrence must be structured as an associative list (ISIS-JSON type 2) or a
+dictionary (ISIS-JSON type 3), unless we want to parse the subfields every time
+when indexing. Therefore field #1 is represented as:
 
 <pre>[sourcecode language='javascript']
 {  
 }
 [/sourcecode]</pre>
 
-And to access its value we must write <pre>doc["1"][0][0][1]</pre>. This quickly
-becomes tedious, and very burdensome in more complex cases, for instance when
-retrieving a specific subfield, and not the main one. We will soon show a
-library to make it easier to handle ISIS-JSON type 2 records.
-
-Meanwhile, going back to the simple map function defined above, here it is again:
+And to access its value we must write <code>doc["1"][0][0][1]</code>. This
+quickly becomes tedious, and very burdensome in more complex cases, for
+instance, when retrieving a specific subfield aside from the main one. We will
+soon show a library to make it easier to handle ISIS-JSON type 2 records, but
+first let's return to our simple map function:
 
 <pre>[sourcecode language='javascript']
 function(doc) {
 }
 [/sourcecode]</pre>
 
-In order to make it work, you must put that function within a view, and that
-goes into a special "design document" in CouchDB. These are also JSON documents,
-but with a particular structure, and they may contain JavaScript functions for
-views and for other purposes such as formatting output, validating document
+This function is part of a view in a "design document" in CouchDB. A design
+document is written in JSON, stored in CouchDB as other documents are, but its
+identifier starts with "_design/". It may contain JavaScript functions for
+creating views as well as formatting output, validating document
 inserts/updates, etc. Each CouchDB database may have several design documents,
 and each design document may have several views.
 
 When trying out CouchDB initially, the easiest way to create a view is by using
 the "Temporary view..." option of the view dropdown in the top right area of the
 Futon interface. Then you can run the map/reduce functions and quickly see their
-results (if your dataset is not too large). To save your work in to a permanent
+results (if your dataset is not too large). To save your work to a permanent
 view, you will be prompted to provide a design document name and a view name.
 
 For any serious work, the couchapp tool is highly recommended [XXX link]. It
 allows you to develop your design documents in your local filesystem, using your
-favorite editor and version control system, and then you may push your code to a
-local or remote CouchDB instance with a simple command. All the views for this
-article were developed like this. The code is in Bitbucket [XXX link].
+favorite editor and version control system, then push your code to a
+local or remote CouchDB instance with a command. All of the views for this
+article were developed in this manner. The code is in Bitbucket [XXX link].
 
 [temp-view.png screenshot]
 
-When a view is first visited via HTTP, CouchDB indexes all the documents by
+When a view is first visited via HTTP, CouchDB indexes all of the documents by
 applying the map function to each of them. CouchDB also incrementally updates
-the indexes if documents are inserted or updated. The indexing is only done by
-demand, when a view is actually requested, and not when documents are created or
-changed.
+the indexes if documents are inserted or updated. The indexing is only done on
+demand, when a view is actually requested, and not when documents are created
+or changed.
 
 To install the map function above, we created a design document called "lilacs"
 and within it a view called "center" (for "cooperating center"). Here is part of
-the result of acessing that view with the curl utility:
+the result of requesting that view with the curl utility:
 
 <pre>
 $ curl -s http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center | head -5
 Note that the result comes as a JSON object with three properties: total_rows,
 offset and rows, the latter being an array of map function results. Besides the
 key and value generated by the emit function, each result also has an id
-attribute, which carries _id property of the corresponding indexed document. By
-default, the result is sorted in ascending key order. Here we used the head
-shell command to crop the displayed results; we will soon see a way of limiting
-the results actually sent by CouchDB.
+attribute, which carries the _id property of the corresponding indexed
+document. By default, the result is sorted in ascending key order. Here we used
+the head shell command to crop the displayed results; we will soon see a way of
+limiting the results actually sent by CouchDB.
 
-By the way, the URL shown above is public, you should be able to access it.
-However, the lilcouch/ database there contains only a sample of 1000 records,
+By the way, the URL shown above is public and you should be able to access it.
+However, the "lilcouch" database there contains only a sample of 1000 records,
 not the full LILACS database.
 
 <h3>Querying a view</h3>
 
 As we have just seen, CouchDB queries are executed by making HTTP requests on
-views. If no arguments are passed, all rows of the result are returned. However,
-there are several arguments that can be used to filter the results. For example,
-the following query uses the descending and limit arguments. Note the use of
-single quotes around the URL. This is necessary because & is a shell operator.
-In this case the head shell command was not used so what you see is the entire
-result set. The total_rows property still counts 926, but only the last three
-rows where returned because of the descending and limit options.
+views. If no arguments are passed, all rows of the result are returned.
+However, there are several arguments that can be used to filter the results.
+For example, the following query uses the "descending" and "limit" arguments.
+Note the use of single quotes around the URL, necessary because "&" is a shell
+operator. In this case the "head" command was not used. The total_rows property
+still counts 926, but only the last three rows are returned because of the
+filtering options.
 
 <pre>
 $ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?descending=true&limit=3'
 </pre>
 
 Another possibility is to filter by key. The next query returns only the
-documents created by the cooperating center with code "CO113"
+documents created by the cooperating center with code "CO113":
 
 <pre>
 $ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?key="CO113"'
 </pre>
 
 It is also possible to limit by starting and ending keys. For example, this
-query returns the documents created by centers starting with CO prefix (for
+query returns the documents created by centers starting with the CO prefix (for
 Colombia), up to and including the CO149 center. Note that the offset property
-tells us that 744 rows were skipped to get to the first with key="CO":
+tells us that 744 rows were skipped to get to the first one with key="CO":
 
 <pre>
 $ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/center?startkey="CO"&endkey="CO149"'
     <dd>returns first occurrence of tag in record, otherwise returns missing
     value;</dd>
 <dt>getall(record, tag)</dt>
-    <dd>returns all occurrences of tag in record or an empty list of the tag is
+    <dd>returns all occurrences of tag in record or an empty list if the tag is
     not found;</dd>
 <dt>getsub(record, tag, key, missing)</dt>
     <dd>returns subfield identfied by key in the first occurrence of tag in
     record, otherwise returns missing value;</dd>
 <dt>getallsub(record, tag, key, missing)</dt>
-    <dd>returns an array with contents of subfield identfied by key in all
+    <dd>returns an array with contents of subfield identified by key in all
     occurrences of tag in record;</dd>
 </dl>
 
 
 [XXX: screenshot qunit.png]
 
-Now back to the map function, note that the call to emit uses the subfield
+Now, back to the map function, note that the call to emit uses the subfield
 occurrence as key and the value is just a number 1. This is because the intent
 of this view is to produce an aggregate count of each different key. To achieve
 this, we need a reduce function to sum the values emitted, like this:
 }
 [/sourcecode]</pre>
 
-Now that our au_countries view has both a map and a view function, we can query
+Now that our au_countries view has both a map and reduce function, we can query
 it:
 
 <pre>$ curl -s 'http://ramalho.couchone.com/lilcouch/_design/lilacs/_view/au_countries'
 ]}
 </pre>
 
-The result set above, limited to 10 rows, shows number of occurrences of each
-key. Obviously there are four entries for Brazil, with alternate and also wrong
-spellings. This would have to be dealt with elsewhere. The key point here is
-that the reduce function and the group option work together to produce aggregate
-results, similar to the ones we can produce with the SQL GROUP BY clause in a
-relational DBMS.
+The result set above, limited to 10 rows, shows the number of occurrences of
+each key. Obviously there are four entries for Brazil, with alternate and
+incorrect spellings. This would have to be dealt with elsewhere. The key point
+here is that the reduce function and the group option work together to produce
+aggregate results, similar to the ones we can produce with the SQL GROUP BY
+clause in a relational DBMS.
 
 <h2>Results and Conclusion</h2>
 
 <h3>Results</h3>
 
 We have identified in CouchDB and MongoDB two modern, Open Source database
-systems which are suitable to handle semistructured records like those defined
-by the ISO-2709 standard and the ISIS family of systems, therefore serving the
-needs of MARC and LILACS datasets.
+systems which are suitable for semistructured records like those defined by the
+ISO-2709 standard and the ISIS family of systems, serving the needs of MARC and
+LILACS datasets.
 
 Furthermore, we created a tool to convert ISIS records from the ISO-2709 format
-to JSON documents suitable for loading into CouchDB (or MongoDB, although we
-have not shown that in this paper).
+to JSON documents suitable for loading into CouchDB (or MongoDB, though we have
+not shown that in this paper).
 
 We considered a number of alternative representations for ISIS data in JSON
-format, and in this paper we used the type 2 representation which, although
+format. In this paper we used the type 2 representation which, though
 somewhat awkward to work with, preserves subfield ordering and allows for
 repeated subfields, a feature of MARC records. ISIS fields do not have
 indicators, so we have not discussed how to represent them in JSON. One
 
 <h3>Conclusion</h3>
 
-These experiments and developments have shown that it is easy to convert
-ISO-2709 data to a document database like CouchDB or MongoDB. After doing so,
-it becomes almost trivial to create Web services to publish the data,
-particularly in CouchDB, thanks to its native support to JSON over HTTP.
+These experiments and developments have shown that it is easy to import
+ISO-2709 data into a document database like CouchDB or MongoDB.  After doing
+so, it becomes almost trivial to create Web services to publish the data,
+particularly in CouchDB, thanks to its native support for JSON over HTTP.
 
 While the semistrucured data model was only formalized in the mid 1990's, the
 ISO-2709 and ISIS record formats have always been concrete, albeit limited,
-examples of it. Research into that model includes results such as algorihmts to
+examples of it. Research into that model includes results such as algorithms to
 extract a formal schema from actual datasets, methods for dealing with shared or
 duplicate data, and a normal form adapted to semistructured schemas (Tok, 2005).
-We have much to learn and apply from the semistructured data research into our
+We have much to learn and apply from semistructured data research into our
 daily work with bibliographic records.
 
-At BIREME/PAHO/WHO we continue investigating the challenges and opportunities of
-converting from the ISIS legacy to modern Open Source document databases.
-Meanwhile, we are also developing new applications, not limited to the ISIS data
-model and legacy data, using Python, the Pyramid framework, JavaScript and
-CouchDB. These new developments allow us to think about how we want the LILACS
-bibliographic database to look like in the year 2015, when we will celebrate its
-30th aniversary.
+At BIREME/PAHO/WHO we continue investigating the challenges and opportunities
+of converting from the ISIS legacy systems to modern Open Source document
+databases. Meanwhile, we are developing new applications, not limited to
+the ISIS data model and legacy data, using Python, the Pyramid framework,
+JavaScript and CouchDB. These new developments allow us to think about how we
+want the LILACS bibliographic database to look in the year 2015, when we
+will celebrate its 30th aniversary.
 
 
 <h2>References</h2>
 child elements.” Mixed content also exists in HTML and in SGML, their common
 ancestor.
 
-[60] Alan Hopkinson “CDS/ISIS: the second decade 
+[60] Alan Hopkinson “CDS/ISIS: the second decade”
 http://idv.sagepub.com/cgi/content/abstract/21/1/31
 
 [70] http://bsonspec.org/#/specification
 
 [130] http://journal.code4lib.org/articles/3832
 
-[140] Association lists are not to be confused with an associative arrays, such
+[140] Association lists are not to be confused with associative arrays, such
 as those in PHP. An associative array is like a hash in Perl and Ruby, a Python
 dictionary or a JSON/JavaScript object. An association list, or alist, however,
 is not a primitive type in those languages, but can be built as an array of
 any language. An Erlang view server is bundled with recent versions of CouchDB,
 but is not enabled by default. Python can be easily configured for that purpose,
 and Java is also known to be used. A list of view server implementations can be
-found at http://wiki.apache.org/couchdb/View_server
+found at http://wiki.apache.org/couchdb/View_server.