Commits

Gabriel Farrell  committed 29f0b4c

More copy edits

  • Participants
  • Parent commits b3c7b6f

Comments (0)

Files changed (1)

File code4lib13/isis2couchdb.txt

 
 Except for the repetition of the "email" key, it's very similar to JSON, or
 JavaScript Object Notation (Crockford 2006). The presence of keys describing
-fields like "name", "first," etc., and the storage of those keys alongside the
+fields like "name," "first," etc., and the storage of those keys alongside the
 data, is what the encyclopedia entry means by coexisting data values and schema
 components (Liu, 2009). It is a characteristic shared by ISO-2709, XML, and
 JSON (with the limitation that ISO-2709 fields are identified by tags composed
 as deep as graph databases (designed to allow general queries over paths of
 nested objects). What they offer is somewhere in between: JSON-like records
 allowing nested structures, and expressive query languages to index and retrieve
-those records. Those rich records are called “documents”. CouchDB and MongoDB
-call themselves “document databases”.
+those records. Those rich records are called “documents.” CouchDB and MongoDB
+call themselves “document databases.”
 
 In the case of CouchDB, the document format is JSON. MongoDB uses BSON, a binary
 format inspired by JSON but offering more data types, such as int32, datetime,
 appeared before all occurrences of field #10, while in the JSON representation
 the key “10” precedes key “2”. This is irrelevant in practice. But, crucially,
 the order of the authors is the same in both formats: first “Kanda, Paulo
-Afonso, then “Smidth, Magali Taino.
+Afonso, then “Smidth, Magali Taino.
 
 Regarding subfield ordering within a field, the LILACS data dictionary does
 establish a canonical ordering. For example, in the case of field #10 the order
 To convert ISIS records into JSON structures, we developed a Python script
 called isis2json.py. It can be executed with both the Python and Jython
 interpreters, versions 2.5 through 2.7. When running under Python it can read
-only ISO-2709 files, but as a Jython script it leverages the ZeusIII Java library
-developed by Heitor Barbieri at BIREME/OPAS/OMS and can also read binary ISIS
-files in .MST format directly. Several options control the structure of the JSON
-output. For example, the command line below generates output suitable for batch
-importing to CouchDB:
+only ISO-2709 files, but as a Jython script it leverages the ZeusIII Java
+library developed by Heitor Barbieri at BIREME/OPAS/OMS and can also read
+binary ISIS files in .MST format directly. Several options control the
+structure of the JSON output. For example, the command line below generates
+output suitable for batch importing to CouchDB:
 
 <pre>
-  $ ./isis2json.py cds.iso -c -f -q 100 > cds1.json
+$ ./isis2json.py cds.iso -c -f -q 100 > cds1.json
 </pre>
 
 The arguments used in the above example are:
 is done with a PUT request:
 
 <pre>
- $ curl -X PUT http://admin_user:password@127.0.0.1:5984/lilacs
+$ curl -X PUT http://admin_user:password@127.0.0.1:5984/lilacs
 </pre>
 
 One operation which that can only be done using cURL or some custom built HTTP
 convert an ISO-2709 file to JSON, and then upload it to CouchDB in two steps:
 
 <pre>
-  $ ./isis2json.py cds.iso -c > cds1.json
-  $ curl -d cds1.json -H"Content-Type: application/json" \
-         -X POST http://127.0.0.1:5984/cds/_bulk_docs
+$ ./isis2json.py cds.iso -c > cds1.json
+$ curl -d cds1.json -H"Content-Type: application/json" \
+       -X POST http://127.0.0.1:5984/cds/_bulk_docs
 </pre>
 
 When used together, the -q and the -s (skip) options allow splitting the output
 64MB, took 147 seconds using this shell command:
 
 <pre>
-  $ isis2json.py -c -p v -i 2 -q 20000 lilacs100k.iso | \
-    curl -d @- -H"Content-Type: application/json" \
-         -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
+$ isis2json.py -c -p v -i 2 -q 20000 lilacs100k.iso | \
+  curl -d @- -H"Content-Type: application/json" \
+       -X POST http://127.0.0.1:5984/lilacs/_bulk_docs
 </pre>
 
 Loading the same data in two 10,000-record batches took 100 seconds when taking