crowdhwr is collaborative manuscript transcription system.
Optical character recognition, usually abbreviated to OCR, is the mechanical or electronic conversion of scanned images of handwritten, typewritten or printed text into machine-encoded text.
Handwriting recognition (or HWR) is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition.
Off-line recognition HWR
Off-line handwriting recognition involves the automatic conversion of text in an image into letter codes which are usable within computer and text-processing applications. The data obtained by this form is regarded as a static representation of handwriting. Off-line handwriting recognition is comparatively difficult, as different people have different handwriting styles. And, as of today, OCR engines are primarily focused on machine printed text and ICR for hand "printed" (written in capital letters) text. There is no OCR/ICR engine that supports handwriting recognition as of today.
Scripto opens up the possibilities of community transcription for digital humanities projects in universities, libraries, archives, and museums. With easy-to-implement extensions for the popular open source content management system, including Omeka, WordPress, and Drupal, Scripto allows administrators for any project with collection materials requiring a transcription can now enlist a community of enthusiasts to participate in this aspect of cultural heritage work.
Scripto is an open-source tool that permits registered users to view digital files and transcribe them with an easy-to-use toolbar, rendering that text searchable. The tool includes a versioning history and editorial controls to make public contributions more manageable, and supports the transcription of a wide range of file types (both images and documents).
Developed by the Roy Rosenzweig Center for History and New Media, Scripto was funded by the National Endowment for the Humanities and the National Archives and Records Administration’s National Historical Publications and Records Commission.
Implemented in PHP using Zend Framework and Media Wiki as a library, that can be used in other systems, such as Drupal, WordPres and Omeka.
DIY History lets you do it yourself to help make historic artifacts easier to use. Our digital library holds hundreds of thousands of items -- much more than library staff could ever catalog alone, so we're appealing to the public to help out by attaching text in the form of transcriptions, tags, and comments. Through "crowdsourcing," or engaging volunteers to contribute effort toward large-scale goals, these mass quantities of digitized artifacts become searchable, allowing researchers to quickly seek out specific information, and general users to browse and enjoy the materials more easily. Please join us in preserving our past by keeping the historic record accessible -- one page or picture at a time.
The Proofread Page extension can render a book either as a column of OCR text beside a column of scanned images, or broken into its logical organization (such as chapters or poems) using transclusion.
The extension is intended to allow easy comparison of text to the original and allow rendering of a text in several ways without duplicating data. Since the pages are not in the main namespace, they are not included in the statistical count of text units.
The extension is installed on all Wikisource wikis.
Example page with transcript:
FromThePage is free software that allows volunteers to transcribe handwritten documents on-line. It's easy to index and annotate subjects within a text using a simple, wiki-like mark-up. Users can discuss difficult writing or obscure words within a page to refine their transcription. The resulting text is hosted on the web, making documents easy to read and search.
- Wiki-style Editing: Users add or edit transcriptions using simple, wiki-style syntax on one side of the screen while viewing a scanned image of the manuscript page on the other side.
- Version Control: Changes to each page transcription are recorded and may be viewed to follow the edit history of a page.
- Wikilinks: Subjects mentioned within the document may are indexed via simple wikilinks within the transcription. Users can annotate subjects with full subject articles.
- Presentation: Readers can view transcriptions in a multi-page format or alongside page images. They can also read all the pages that mention a subject
- Automatic Markup: FromThePage can suggest wikilinks to editors by mining previously edited transcriptions. This helps insure editorial consistency and vastly reduces the amount of effort involved in markup.
- Internet Archive integration: FromThePage can be pointed at manuscripts hosted on Archive.org. It will import the page structure and any printed page titles into its native format for transcription, while serving page images from the Internet Archive.
Implemented in Ruby on Rails. Uses MySQL for database.
HTML5 SVG editor, that can be used to select words in hand written text.