Clone wiki

thingfish / Home


ThingFish is a extensible digital asset manager written in Ruby. It can be used to store chunks of data on the network in an application-independent way, associate the chunks with other chunks through metadata, and then search for the chunk you need later and fetch it again, all through a REST API over HTTP.


Our most recent release is 0.3.0.



The system should, in its most basic form, only do two things:

  1. Store files via a network interface.
  2. Store metadata about the files and provide a search facility for finding files via their associated metadata.


The system should have as much of the backend details abstracted out into plugin functionality as possible. This will allow the basic system to remain simple and be expanded to fit an environment's needs. It also makes incremental functionality easier, as plugins can be created when the functionality they encapsulate is required rather than up front.

We wish to minimize the dependencies necessary to get a basic installation up and running. The base system should only require a recent installation of Ruby for minimal functionality. Plugins which extend it or replace the default simple backends with better-tuned and functional ones may depend on whatever they wish.

Language Neutrality

The service API presented by ThingFish should be as portable as possible, requiring only network sockets and an standards-compliant implementation of HTTP 1.1.

To this end, we've chosen the REST architectural style.


While scalability is an obvious goal for most every network-accessible service, we feel like it's important to consider it up front.

Because of its modularity, ThingFish should be able to scale both deep and wide without sacrificing simplicity in the default configuration. New strategies for scalability (caching, file storage, metadata semantics) can be introduced as they are needed without having to take their implementation into consideration for the initial system.

Using a REST API also helps with wide scalability, as it is a stateless protocol and therefore can be load-balanced with little to no changes to the server software.


Default Handler

Fetch the toplevel index (exactly what this means is subject to content negotiation)
Return the data for a given file
GET /«uuid»
Upload a file
Replace a file's data
PUT /«uuid»
Delete a file from the datastore
DELETE /«uuid»

Search Handler

Returns a list of URIs for files which match the given search criteria.

Find files with a given filename
GET /search?filename=ovenmitt.jpg
Find files with given tags
GET /search?tag=(pain|firing%20squad)+ovenmitt
Find a list of files with a complex query
GET /search?tag=nsfw;filename=logo*;created=before+1/12/2007;owner=mahlon
Complex query interface
Find a list of still images created by the same person in the same namespace as a given resource via a metastore implementation-specific query (RDF+SPARQL in this example):
POST /search HTTP/1.1
Content-type: application/sparql-query

PREFIX rdf: <>
PREFIX dc: <>
PREFIX dcmi: <>
PREFIX thingfish: <>
SELECT ?urn 
    urn:uuid:c10b7ee8-cdad-11db-a110-23336f446aba dc:creator ?person
    urn:uuid:c10b7ee8-cdad-11db-a110-23336f446aba thingfish:namespace ?ns
    ?urn dc:creator ?person
    ?urn dc:type dcmi:StillImage
    ?urn thingfish:namespace ?ns

Metadata Handler

Returns a list of metadata tuples.

Return a list of all metadata tuples for a given file
GET /metadata/«uuid»
Find all tags in the store
GET /metadata/tag
Find all tags for a given file
GET /metadata/«uuid»/tag
Return the first preview that matches the request's Accept header for a given file
GET /metadata/«uuid»/preview
Add a tag for the given file
POST /metadata/«uuid»/tag
Replace the namespace for the given file
PUT /metadata/«uuid»/namespace
Delete a namespace for the given file
DELETE /metadata/«uuid»/namespace

Note that you won't be able to do this: DELETE|POST /metadata/«uuid» since that would dirty the info for underlying data. These should return a BAD_REQUEST.

Admin Handler

Show available diskspace
GET /admin/diskspace
Show and edit space/quotas per user
GET /admin/quotas
Cleanup and maintenance (candidates for deletion?)
GET /admin/cleanup
Show current usage (status, stats, and graphs -- rrd tool, IO, trending performance)
GET /admin/status

Additional Features

Auto-Generation of Metadata

ThingFish will also support extraction and auto-generation of metadata from the stored file.


  • Detection of filetype based on magic for less-useful upload mimetypes
    • application/octet-stream
    • text/plain
  • Previews for appropriate mimetypes
  • Extraction of embedded metadata (e.g., camera info, codec, etc.)
  • Pluggable extractions
    • Each extractor knows what mimetypes it can extract its metadata from
  • upload time
  • uploading agent (e.g., User-Agent header)
  • uploading ip

We're trying to name metadata according to the conventions of the Dublin Core where possible/appropriate. The default metadata we're currently extracting (from the HTTP request) is:

DescriptionDublin Core TypeMetastore Attribute
Uploading IPn/auploadaddress
Upload Datecreatedcreated
Modified Datemodifiedmodified

Content Negotiation

The daemon will also support pluggable transparent HTTP content negotiation, which will allow customizable serialization of complex datatypes and on-the-fly transformation of fetched files.

For URIs that return RDF triples or other structural data, the client will be able to fetch it in YAML, JSON, XML, HTML, or perhaps other formats (Turtle?, N3?)

This will be implemented with a table of transformations from one mimetype to another. If the mimetype of the file is in the accepts list, return it as-is. If not, a response format is detemined by taking the list of formats the requester accepts, then iterating over the table of transformations. If a transformation exists for the requested type, it executes and returns the data in the requested format.

Pluggable output formats:

  • YAML
  • JSON
  • XML
  • HTML
  • RSS
  • Image/audio re-encoding PNG->TIFF, etc.
  • PDF
  • Human-readable text

Filetype conversion 'caching'

Depending on the filetypes, doing continued conversions that just get thrown away and recalculated each request could potentially tax the server. The filter interface (parent class) should have an API interface for storing the conversion results as new ThingFish resources (with a special 'variant' metadata key), and updating the original resource metadata with a reference to the new type and uuid. Each filter that performs a conversion could check the original UUID for a reference to a pre-calculated version first.

Later, the /admin interface could have a "variants" section (or whatever) that would display total diskspace in use by variants, and optionally purge them all wholesale. Neat.


File Storage

The storage backend should be pluggable. The default implementation should use a simple filesystem directory structure, perhaps with hashed names based on a resource's UUID.

Duplicates should be avoided via checksumming, perhaps with the error response returning a referral to the original via a Location header or something.

We also need to handle the case of duplicates being uploaded in the case where there are ACLs which restrict access to the original copy. In the case where there's already a resource in the filestore with the same checksum that is not accessible to the uploading user, a duplicate *should* be transparently created. We want to avoid informing the second uploading user of the first object's existence if she doesn't have permissions to view it to avoid information leakage.

Associated Metadata

The default metadata structure for ThingFish files will be a basic key/value list implemented with an in-memory hash. You can customize the metadata layer via pluggable metastore strategies.

We're writing a metadata plugin for the LAIKA ThingFish installation that adds a semantic layer implemented using RDF via the the Redleaf RDF library. We'll start out with the Dublin Core Metadata Terms at a minimum, and then add other RDF vocabularies. Some likely additions:

FOAF (Friend of a Friend)
designed to describe people, their interests and interconnections.
DOAC (Description of a Career)
supplements FOAF to allow the sharing of résumé information.
DOAP (Description of a Project)
designed to describe software projects; uses FOAF to identify the people involved
Images Ontology
Ontology for Images, image regions (SVG), videos, frames, segments, and what they depict.
Photography Vocabulary
Definitions of various terms related to photographs and photography equipment.

For more on the Semantic Store, see the Semantic Metastore page.


Metadata searching will support two interfaces: a basic query mapper that will generate simple queries via a naive interface on the current metastore strategy, and a more-robust query engine that will provide a raw, implementation-specific query interface via a POST request, with the body of the request containing the query text.

Results from a search will be returned via one of the results-serialization strategies.

"Advanced" Search Handler


Developer Notes
We're keeping notes as we develop about various things so we don't forget them.
A list of ideas for smallish projects with ThingFish backends that will help drive the implementation.
Language Examples
A few client examples of using ThingFish server introspection in various languages.