Source

stringsearch / Data / ByteString / Search / BoyerMoore.hs

-- |
-- Module         : Data.ByteString.Search.BoyerMoore
-- Copyright      : Daniel Fischer
--                  Chris Kuklewicz
-- Licence        : BSD3
-- Maintainer     : Daniel Fischer <daniel.is.fischer@googlemail.com>
-- Stability      : Provisional
-- Portability    : non-portable (BangPatterns)
--
-- Fast overlapping Boyer-Moore search of both strict and lazy
-- 'ByteString' values.
--
-- Descriptions of the algorithm can be found at
-- <http://www-igm.univ-mlv.fr/~lecroq/string/node14.html#SECTION00140>
-- and
-- <http://en.wikipedia.org/wiki/Boyer-Moore_string_search_algorithm>
--
-- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and
-- Chris Kuklewicz (haskell at list.mightyreason.com).
module Data.ByteString.Search.BoyerMoore
    {-# DEPRECATED "Use the new interface instead" #-} (
                                         -- * Overview
                                         -- $overview

                                         -- ** Changes
                                         -- $changes

                                         -- ** Deprecation
                                         -- $deprecation

                                         -- ** Parameter and return types
                                         -- $types

                                         -- ** Lazy ByteStrings
                                         -- $lazy

                                         -- ** Performance
                                         -- $performance

                                         -- ** Complexity
                                         -- $complexity

                                         -- ** Partial application
                                         -- $currying

                                         -- ** Integer overflow
                                         -- $overflow

                                         -- * Functions
                                           matchLL
                                         , matchLS
                                         , matchSL
                                         , matchSS
                                         ) where

import Data.ByteString.Search.Internal.BoyerMoore
            (matchLS, matchSS)
import Data.ByteString.Lazy.Search.Internal.BoyerMoore
            (matchLL, matchSL)

-- $overview
--
-- This module exists only for backwards compatibility. Nevertheless
-- there have been small changes in the behaviour of the functions.
-- The module exports four search functions: 'matchLL', 'matchLS',
-- 'matchSL', and 'matchSS'. All of them return the list of all
-- starting positions of possibly overlapping occurrences of a pattern
-- in a string.

-- $changes
--
-- Formerly, all four functions returned an empty list when passed
-- an empty pattern. Now, in accordance with the functions from the other
-- modules, @matchXY \"\" target = [0 .. 'length' target]@.

-- $deprecation
--
-- This module is /deprecated/. You should use the new interface provided
-- in "Data.ByteString.Search" resp. "Data.ByteString.Lazy.Search".

-- $types
--
-- The first parameter is always the pattern string.  The second
-- parameter is always the target string to be searched.  The returned
-- list contains the offsets of all /overlapping/ patterns.
--
-- A returned @Int@ or @Int64@ is an index into the target string
-- which is aligned to the head of the pattern string.  Strict targets
-- return @Int@ indices and lazy targets return @Int64@ indices.  All
-- returned lists are computed and returned in a lazy fashion.

-- $lazy
--
-- 'matchLL' and 'matchLS' take lazy bytestrings as patterns.  For
-- performance, if the pattern is not a single strict chunk then all
-- the the pattern chunks will copied into a concatenated strict
-- bytestring.  This limits the patterns to a length of (maxBound ::
-- Int).
--
-- 'matchLL' and 'matchSL' take lazy bytestrings as targets.
-- These are written so that while they work they will not retain a
-- reference to all the earlier parts of the the lazy bytestring.
-- This means the garbage collector would be able to keep only a small
-- amount of the target string and free the rest.

-- $currying
--
-- These functions can all be usefully partially applied.
-- Given only a pattern the partially applied version will compute
-- the supporting lookup tables only once, allowing for efficient re-use.
-- Similarly, the partially applied 'matchLL' and 'matchLS' will compute
-- the concatenated pattern only once.

-- $performance
--
-- In general, the Boyer-Moore algorithm is the most efficient method to
-- search for a pattern inside a string. The advantage over other algorithms
-- (e.g. Na&#239;ve, Knuth-Morris-Pratt, Horspool, Sunday) can be made
-- arbitrarily large for specially selected patterns and targets, but
-- usually, it's a factor of 2&#8211;3 versus Knuth-Morris-Pratt and of
-- 6&#8211;10 versus the na&#239;ve algorithm. The Horspool and Sunday
-- algorithms, which are simplified variants of the Boyer-Moore algorithm,
-- typically have performance between Boyer-Moore and Knuth-Morris-Pratt,
-- mostly closer to Boyer-Moore. The advantage of the Boyer-moore variants
-- over other algorithms generally becomes larger for longer patterns. For
-- very short patterns (or patterns with a very short period), other
-- algorithms, e.g. "Data.ByteString.Search.DFA" can be faster (my
-- tests suggest that \"very short\" means two, maybe three bytes).
--
-- In general, searching in a strict 'S.ByteString' is slightly faster
-- than searching in a lazy 'L.ByteString', but for long targets the
-- smaller memory footprint of lazy 'L.ByteStrings' can make searching
-- those (sometimes much) faster. On the other hand, there are cases
-- where searching in a strict target is much faster, even for long targets.
--
-- On 32-bit systems, 'Int'-arithmetic is much faster than 'Int64'-arithmetic,
-- so when there are many matches, that can make a significant difference.
--
-- Also, the modification to ameliorate the case of periodic patterns
-- is defeated by chunk-boundaries, so long patterns with a short period
-- and many matches exhibit poor behaviour (consider using @indices@ from
-- "Data.ByteString.Lazy.Search.DFA" or "Data.ByteString.Lazy.Search.KMP"
-- in those cases, the former for medium-length patterns, the latter for
-- long patterns; only 'matchLL' and 'matchSL' suffer from
-- this problem, though).

-- $complexity
--
-- Preprocessing the pattern string is O(@patternLength@).  The search
-- performance is O(@targetLength@\/@patternLength@) in the best case,
-- allowing it to go faster than a Knuth-Morris-Pratt algorithm.  With
-- a non-periodic pattern the worst case uses O(3\*@targetLength@)
-- comparisons.  The periodic pattern worst case is quadratic
-- O(@targetLength@\*@patternLength@) complexity for the original
-- Boyer-Moore algorithm.
--
-- The searching functions in this module contain a modification which
-- drastically improves the performance for periodic patterns.
-- I believe that for strict target strings, the worst case is now
-- /O/(@targetLength@) also for periodic patterns and for lazy target
-- strings, my semi-educated guess is
-- /O/(@targetLength@ * (1 + @patternLength@ \/ @chunkSize@)).

-- $overflow
--
-- The current code uses @Int@ to keep track of the locations in the
-- target string.  If the length of the pattern plus the length of any
-- strict chunk of the target string is greater or equal to
-- @'maxBound'::Int@ then this will overflow causing an error.  We try
-- to detect this and call 'error' before a segfault occurs.