Source

astrosearch / TwitterTokens.hs

Full commit
  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
{-# LANGUAGE OverloadedStrings #-}

{-|

An attempt to tokenize a Tweet. It is based on the twokenize.scala
code from 

https://github.com/brendano/ark-tweet-nlp/blob/master/src/edu/cmu/cs/lti/ark/tweetnlp/twokenize.scala

with changes to suit using Attoparsec rather than a RegEx machine
and a few extras driven by the #AAS219 data set. It is possible
that some of the tokenization - in particular to create the
"protected" tokens - could be cleaned up. 

It is a shame that I threw away the positions of URLs and entities
calculated by Twitter as that would save some work here.

Issues: note that some of these have been addressed since the list was created,
  
 - partial fix for the "(foo" and "foo)" change
 - added … to the list of punctuation
 - added abbreviations for a.m. p.m. and a.k.a.

*)

Lucky you ! MT @jeff_foust : Made it to Austin last night for #AAS219 for a few days . Sitting in on the ExoPAG ( exoplanet sci working group)
NASA to issue RFI in next few weeks to solicit ideas for probe - class ( < $1B) exoplanet missions ; start of long - term concept studies . #aas219
Bound for #AAS219 (@ Ronald Reagan Washington National Airport ( DCA ) w/ 42 others) http://t.co/cWL6yOPx
Lounging briefly before boarding my flight to Austin to attend #aas219 (@ United Club) http://t.co/IbTxWxzn
#AAPF12 #AAS219 Glikman - Dusty " red quasars " are best thing since sliced bread ( paraphrasing here) . Most luminous objects in universe .
education & public outreach w/ the @usvao virtual observ & microsoft ' s worldwide telescope . Rm19A ( Registration => UP escalator 2X) #aas219
New crowdvote contest for coolest result by grad student at AAS . ( self)Nominate : name , summary of result , #prize #aas219 . Beer/tacos prize .

note that group) not group ) and $1B) not $1B )
note that (@Ronald not ( @Ronald and others) not others )

growth * rate* of faculty jobs since 1998 similar to postdoc positions , but absolute numbers much larger so feels worse . #AAPF12 #aas219
Bummed I ' m not at #aas219 , but the office sure is quiet today . Here ' s hoping I ' ll get * TONS* of work done when I ' ve got the place to myself .
Wait , those two drink tickets are for * coffee* ?!? Remind me again what the huge registration fee is for ? #aas219

Hmm, not just a problem for ) as 'rate*' above shows

... (cont ' d ) Mon posters by D . Perrodin ( 146.22 ) and J . Simon ( 146.23 ) . Tue posters by D . Schmidt ( 237.16 ) and C . Gilpin ( 237.18 ) #aas219

The calm before a big NASA news week at #AAS219 ! (@ The Austin Convention Center/ Neal Kocurek Exhibit Hall) http://t.co/FWzwVlTz
Just heard about Armageddon screening w/commentary at the Ritz on Wed of #aas219 ! Can ' t beat Real Genius from aas211( ? ) but sounds fun !

Note the '&c' in the following:

Kaspar von Braun - GJ876 - refined star parameters - 2 planets ( b &c ) in habitable zone even assuming low heat distribution efficiency - #AAS219

*)

RT @AAS_Press : NASA : Exoplanets , Supernovae , High - Energy Sky , New Images Among NASA News Highlights at AAS Meeting #aas219 . http :// t . co/ ...

If a http is not parsed then gets split up into it's constituents

*)

MT @usvao US VAO Exhibit at #aas219 - sharing & demonstrating new tools/services for data - intensive astronomy #rcuk #stfc #jiscmrd
Idea is to be similar to the Great Observatories program ( Hubble/Compton/Chandra/Spitzer ) ; diff capabilities to meet multiple needs . #aas219

should we split tools/services into two words?

*) 

@polarisdotca @kellecruz @derekbruff I think @doug_burke already started a #aas219 tracking page at the archivi st.

why is archivist split?

*)

D ' Abrusco : investigating the provenance of a quasar SED he built from web services & home grown ASCII files at the @usvao workshop #aas219

any way to not split up D'Abrusco?

Jean - luc doumont on slides : have a title s that are sentences that no longer than a tweet #aas219

and presumably Jean-luc

*)

At the #aas219 meeting in Austin this week ? Check out the following #fandm pulsar presentations : Mon a . m . talk by A . Lommen ( 108.03 ) ...
#aas219 is the pubic talk tomorrow night by weinberg actually public ? i . e . can i invite friends not registered for the meeting ?

a.m. kept together? i.e. too?

Looking forward to seeing @6thgradersrule at the K - 12 reception at #aas219 Sunday .
off to #aas219 to speak on cyber discovery & mining astronomical datastrea ms. .. a . k . a . tweets from neutron stars

Could group K-12 into a single token.
a.k.a. too?

@chrislintott Great idea , doing a paper . li for #aas219 ! Wish I was there in person ...

paper.li could be kept together

*) 

Web Cash , who has been pushing starshade tech development for exoplanet searches , says he ' s won funding to test it using a zeppelin… #aas219
…small starshade will be hung from Airship Ventures zeppelin , with groundbased telescope ; hope to be doing astronomy by summer . #aas219
Just arrived in Austin for #aas219 but now I need to eat ! No food on this dang flight… http://t.co/GKTtrKXE

should have the ... split off from zeppelin and small

-}

module TwitterTokens (
  Protected(..)
  , fromProtected
  , simpleTokenize
  , tokenize
  , separateLine
  ) where

import qualified Data.Attoparsec.Text as A
import qualified Data.Text as T

import Control.Applicative (Applicative(..), (<$>), (<|>))

import Data.Char (isSpace, isAlphaNum, isDigit)
import Data.Maybe (fromMaybe)

{-
Separate out punctuation at the edges of words, so that 
'foo' goes to ' foo ' but don't remains as don't.

According to comments in twokenize.scala this is the major
source of problems.

8220/1 are smart quotes; 8216/7 are single quotes,
and 171/187 are the << >> characters.

TODO: use Data.Char.isPunctuation instead ?

-}
isEdgePunctChar :: Char -> Bool
isEdgePunctChar = (`elem` "'\"\8220\8221\8216\8217\171\187{}()[]*")

{-
The twokenize.scala version is [a-zA-Z0-9], so we include more
characters here.
-}
isNotEdgePunctChar :: Char -> Bool
isNotEdgePunctChar = isAlphaNum 

isOffEdgeChar :: Char -> Bool
isOffEdgeChar c = c `elem` "^$:;" || isSpace c

splitEdgePunct :: T.Text -> T.Text
splitEdgePunct = splitEdgePunctRight . splitEdgePunctLeft

-- This was written before I had decided to use Attoparsec,
-- so it could be re-written.
--
splitEdgePunctLeft :: T.Text -> T.Text
splitEdgePunctLeft = go
  where
    -- match offEdge
    go i = case T.uncons i of
      Nothing -> T.empty
      Just (c, is) -> T.cons c $ if isOffEdgeChar c then go1 is else go is

    -- although not in the original regexp, we need to support offEdge+
    go1 i = case T.uncons i of
      Nothing -> T.empty
      Just (c, is) -> T.cons c $ if isOffEdgeChar c
                                 then go1 is 
                                 else if isEdgePunctChar c
                                      then go2 is
                                      else go is

    -- match edgePunct+
    go2 i = case T.uncons i of
      Nothing -> T.empty
      Just (c, is) -> if isEdgePunctChar c
                      then T.cons c $ go2 is 
                      else let res = T.cons c $ go is
                           in if isNotEdgePunctChar c
                              then T.cons ' ' res
                              else res
      
splitEdgePunctRight :: T.Text -> T.Text
splitEdgePunctRight = go
  where
    -- match notEdgePunct
    go i = case T.uncons i of
      Nothing -> T.empty
      Just (c, is) -> T.cons c $ if isNotEdgePunctChar c
                                 then go1 is
                                 else go is
                                      
    -- match edgePunct+
    go1 i = case T.uncons i of
      Nothing -> T.empty
      Just (c, is) -> if isEdgePunctChar c 
                      then go2 (T.singleton c) is
                      else T.cons c $ go is
        
    go2 cs i = case T.uncons i of
      Nothing -> cs
      Just (c, is) -> if isEdgePunctChar c
                      then go2 (T.snoc cs c) is
                      else let res = T.append cs $ T.cons c $ go is
                           in if isOffEdgeChar c
                              then T.cons ' ' res
                              else res

-- excluding entity from the Scala version as assuming we do not
-- have any
      
data Protected = 
  Heart T.Text
  | Arrow T.Text
  | Emoticon T.Text
  | Url T.Text
  | Email T.Text
  | TimeLike T.Text
  | Num T.Text
  | NumCommas T.Text
  | Punctuation T.Text
  | Abbreviation T.Text
  | Separator T.Text
  | Decorator T.Text
  | EmbeddedApostrophe T.Text
  | HashTag T.Text
  | AtMention T.Text
  deriving (Eq, Show)
  
fromProtected :: Protected -> T.Text           
fromProtected (Heart t) = t
fromProtected (Arrow t) = t
fromProtected (Emoticon t) = t
fromProtected (Url t) = t
fromProtected (Email t) = t
fromProtected (TimeLike t) = t
fromProtected (Num t) = t
fromProtected (NumCommas t) = t
fromProtected (Punctuation t) = t
fromProtected (Abbreviation t) = t
fromProtected (Separator t) = t
fromProtected (Decorator t) = t
fromProtected (EmbeddedApostrophe t) = t
fromProtected (HashTag t) = t
fromProtected (AtMention t) = t
           
protected :: A.Parser Protected
protected = 
  A.choice [heart, arrow, emoticon, url, email, timeLike, 
            numLike, punctuation, abbreviation, separators, 
            decorations, embeddedApostrophe, hashTag, atMention]

optional :: A.Parser a -> A.Parser (Maybe a)
optional p = (Just <$> p) <|> pure Nothing
      
{-             
coptional :: A.Parser Char -> A.Parser T.Text
coptional p = maybe T.empty T.singleton <$> optional p
-}

soptional :: A.Parser T.Text -> A.Parser T.Text
soptional p = fromMaybe T.empty <$> optional p

schar :: Char -> A.Parser T.Text             
schar c = A.char c >> return (T.singleton c)             

-- match <+/?3+
heart :: A.Parser Protected
heart = do
  l <- A.takeWhile1 (== '<')
  m <- optional $ A.char '/'
  r <- A.takeWhile1 (== '3')
  return . Heart . T.concat $ [l, maybe T.empty T.singleton m, r]
      
-- match <*[-=]*>+
rarrow :: A.Parser T.Text
rarrow = do
  l <- A.takeWhile (== '<')
  m <- A.takeWhile (`elem` "-=")
  r <- A.takeWhile1 (== '>')
  return . T.concat $ [l, m, r]
    
-- match <+[-=]*>*         
larrow :: A.Parser T.Text
larrow = do
  l <- A.takeWhile1 (== '<')
  m <- A.takeWhile (`elem` "-=")
  r <- A.takeWhile (== '>')
  return . T.concat $ [l, m, r]
  
arrow :: A.Parser Protected
arrow = Arrow <$> (rarrow <|> larrow)

-- (?iu)[:=]
-- QUS: why the ignore case and unicode flags here; it's
-- not obvious they make a difference
normalEyes :: A.Parser T.Text
normalEyes = schar ':' <|> schar '='

-- [;]
wink :: A.Parser T.Text
wink = schar ';'

-- (|o|O|-|[^a-zA-Z0-9 ])
-- notes:
--   taken out optional part from this rule
--     as this is now handled by the calling rule
--   expanded the [^a-zA0Z0-9 ] rule since this
--     causes problems - e.g. :) should not match ) here
--
noseArea :: A.Parser T.Text
noseArea = 
  schar 'o' <|> schar 'O' <|> schar '-' <|> 
  (T.singleton <$> A.satisfy (A.notInClass " a-zA-Z0-9()[]/\\"))
 
-- [pP]
tongue :: A.Parser T.Text
tongue = schar 'p' <|> schar 'P'

-- [doO/\\]+
-- (remove the / if http://'s aren't cleaned)
otherMouths :: A.Parser T.Text
otherMouths =  A.takeWhile1 (`elem` "doO/\\")

-- [\(\[]+
sadMouths :: A.Parser T.Text
sadMouths = A.takeWhile1 (`elem` "([")
  
-- [D\)\]]+            
happyMouths :: A.Parser T.Text            
happyMouths = A.takeWhile1 (`elem` "D)]")
  
-- standard emoticon
semoticon :: A.Parser T.Text
semoticon = do
  l <- normalEyes <|> wink
  m <- soptional noseArea
  r <- tongue <|> otherMouths <|> sadMouths <|> happyMouths
  return . T.concat $ [l, m, r]

-- reversed emoticon
-- darn it; the regexp uses (?<=( |^)) to make sure
-- this pattern is either preceeded by a space or is
-- at the start of a line (previous steps in the chain will
-- have added spaces before '(' characters when appropriate).
-- How do I handle this in attoparsec, as do not have state
-- to pass around?
-- 
-- so we include the ' ' in the check here and handle ^ by
-- cheating and adding ' ' to the start of any string that
-- is processed. Actually, for now exclude this check and
-- worry about it later
--
--  
remoticon :: A.Parser T.Text
remoticon = do
  -- A.char ' ' >> return ()
  l <- sadMouths <|> happyMouths <|> otherMouths
  m <- soptional noseArea
  r <- normalEyes <|> wink
  return . T.concat $ [l, m, r]

emoticon :: A.Parser Protected
emoticon = Emoticon <$> (semoticon <|> remoticon)

{-

-- support 0 to n applications of the parser
-- n >= 0 is required but not checked for
--
upto :: Int -> A.Parser a -> A.Parser [a]
upto = uptoAB 0
   
-- similar to upto, but requires at least 1 version     
upto1 :: Int -> A.Parser a -> A.Parser [a]
upto1 = uptoAB 1
-}
   
-- parse between n1 and n2 applications of the parser     
-- (inclusive); n1 <= n2 and n1 >= 0 but these constraints
-- are not enforced
--
-- not designed for efficiency
uptoAB :: Int -> Int -> A.Parser a -> A.Parser [a]
uptoAB n1 n2 p = 
  let opts = flip A.count p
      ns = [n2,n2-1..n1]
      ps = map opts ns
  in A.choice ps

{- specialized version of upto which matches
upto N of the given filter and returns a text
string.
-}
{-
cupto :: Int -> (Char -> Bool) -> A.Parser T.Text
cupto = cuptoAB 0
-}

cuptoAB :: Int -> Int -> (Char -> Bool) -> A.Parser T.Text
cuptoAB n1 n2 f = do
  cs <- uptoAB n1 n2 (A.satisfy f)
  return . T.pack $ cs

{-
Since Twitter now re-writes URIs, the filter can be a lot simpler since
we are just looking for

  http://t.co/[a-zA-Z0-9]+
  https://t.co/[a-zA-Z0-9]+

although note that this can be truncated. Should we also be trying to
parse google.com or foo.bar/baz (ie without the leading http://)?

Currently unsupported:
  partial matches (where the URL has been clipped)
  URIs without a leading protocol
-}

url :: A.Parser Protected
url = do
  a <- A.stringCI "http://" <|> A.stringCI "https://"
  b <- A.stringCI "t.co/"
  c <- A.takeWhile1 (A.inClass "a-zA-Z0-9")
  return . Url $ T.concat [a,b,c]

{-
  val Bound = """(\W|^|$)"""
  val Email = "(?<=" +Bound+ """)[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}(?=""" +Bound+")"

for now ignoring the pre/post check.
-}

email :: A.Parser Protected
email = do
  a <- A.takeWhile1 (A.inClass "-a-zA-Z0-9._%+-")
  b <- A.string "@"
  c <- A.many1 (T.snoc <$> A.takeWhile1 (A.inClass "-a-zA-Z0-9_") <*> A.char '.')
  d <- cuptoAB 2 4 (A.inClass "a-zA-Z")
  return . Email $ T.concat (a : b : c) `T.append` d

-- \d+:\d+
timeLike :: A.Parser Protected
timeLike = do
  l <- A.takeWhile1 isDigit
  c <- A.char ':'
  r <- A.takeWhile1 isDigit
  return . TimeLike $ T.snoc l c `T.append` r
  
-- QUS: is it worth just using AttoParsec's number parsers?
  
-- \d+.\d+  
numNum :: A.Parser Protected    
numNum = do
  l <- A.takeWhile1 isDigit
  c <- A.char '.'
  r <- A.takeWhile1 isDigit
  return . Num $ T.snoc l c `T.append` r

{-
  (\d+,)+?\d{3}(?=([^,]|$))

going to drop the check on following characters
-}
numberWithCommas :: A.Parser Protected
numberWithCommas = do
  ts <- A.many1 (T.snoc <$> A.takeWhile1 isDigit <*> A.char ',')
  r <- A.count 3 (A.satisfy isDigit)
  return . NumCommas $ T.concat ts `T.append` T.pack r

numLike :: A.Parser Protected  
numLike = numNum <|> numberWithCommas
  
{-          
  val punctChars = """['“\".?!,:;]"""
  val punctSeq   = punctChars+"""+"""

TODO: use Data.Char.isPunctuation instead?

-}
punctuation :: A.Parser Protected          
punctuation = Punctuation <$> A.takeWhile1 (A.inClass "'“\".?!,:;\8230") -- TODO: how about close smart quote?
          
{-
  val boundaryNotDot = """($|\s|[“\"?!,:;]|""" + entity + ")" 
  val aa1  = """([A-Za-z]\.){2,}(?=""" + boundaryNotDot + ")"
  val aa2  = """[^A-Za-z]([A-Za-z]\.){1,}[A-Za-z](?=""" + boundaryNotDot + ")"
  val standardAbbreviations = """\b([Mm]r|[Mm]rs|[Mm]s|[Dd]r|[Ss]r|[Jj]r|[Rr]ep|[Ss]en|[Ss]t)\."""
  val arbitraryAbbrev = "(" + aa1 +"|"+ aa2 + "|" + standardAbbreviations + ")"

Again, drop the check for following characters.
-}

{-
aa1, aa2 :: A.Parser T.Text
aa1 = 
  let isChar = A.inClass "A-Za-z."
      c2 = T.pack <$> A.count 2 (A.satisfy isChar)
  in T.append <$> c2 <*> A.takeWhile isChar

aa2 = 
  let isChar = A.inClass "A-Za-z."
  in do
    c1 <- A.satisfy (A.notInClass "A-Za-z")
    c2 <- A.takeWhile1 isChar
    c3 <- A.satisfy (A.inClass "A-Za-z")
    return $ T.cons c1 c2 `T.snoc` c3
-}
  
abrevD, abrevJ, abrevM, abrevR, abrevS :: A.Parser T.Text

abrevD = T.cons <$> A.satisfy (`elem` "Dd") <*> A.string "r"
abrevJ = T.cons <$> A.satisfy (`elem` "Jj") <*> A.string "r"

abrevM = do
  s <- A.satisfy (`elem` "Mm")
  e <- A.string "rs" <|> A.string "r" <|> A.string "s"
  return $ T.cons s e
  
abrevR = T.cons <$> A.satisfy (`elem` "Rr") <*> A.string "ep"

abrevS = do
  s <- A.satisfy (`elem` "Ss")
  e <- A.string "en" <|> A.string "r" <|> A.string "t"
  return $ T.cons s e
  
standardAbbreviations :: A.Parser T.Text
standardAbbreviations = do
  f <- A.choice [abrevM, abrevD, abrevS, abrevJ, abrevR]
  c <- A.char '.'
  return $ T.snoc f c

-- added e.g. a.m. a.k.a. p.m. w.r.t.
randomAbbreviations :: A.Parser T.Text  
randomAbbreviations = 
  A.stringCI "e.g." <|> A.stringCI "a.m." <|> A.stringCI "p.m." <|> A.stringCI "a.k.a."
  <|> A.stringCI "w.r.t."

{-
aa1 and aa2 seem to match to much, possibly because I have
not implemented the regexp semantics correctly.
-}
abbreviation :: A.Parser Protected              
-- abbreviation = Abbreviation <$> (aa1 <|> aa2 <|> standardAbbreviations)
abbreviation = Abbreviation <$> (randomAbbreviations <|> standardAbbreviations)
              
separators :: A.Parser Protected
separators = Separator <$> (A.takeWhile1 (== '-') <|> A.string "―")
  
decorations :: A.Parser Protected
decorations = Decorator <$> A.takeWhile1 (== '♫') 

{-
  val thingsThatSplitWords = """[^\s\.,]"""
  val embeddedApostrophe = thingsThatSplitWords+"""+'""" + thingsThatSplitWords + """+"""
-}

embeddedApostrophe :: A.Parser Protected
embeddedApostrophe = 
  let sw = A.takeWhile1 (\c -> not (isSpace c) && c `notElem` ".,")
  in do
    l <- sw
    c <- A.char '\''
    r <- sw
    return . EmbeddedApostrophe $ T.snoc l c `T.append` r
  
hashTag :: A.Parser Protected    
hashTag = do
  t <- T.cons <$> A.char '#' <*> A.takeWhile1 (A.inClass "a-zA-Z0-9_")
  return $ HashTag t
  
atMention :: A.Parser Protected
atMention = do
  t <- T.cons <$> A.char '@' <*> A.takeWhile1 (A.inClass "a-zA-Z0-9_")
  return $ AtMention t
    
{-

  def allowEntities(pat: String)= {
    // so we can write patterns with < and > and let them match escaped html too
    pat.replace("<", "(<|&lt;)").replace(">", "(>|&gt;)")
  }
  
  val Hearts = allowEntities("""(<+/?3+)""")
  val Arrows = allowEntities("""(<*[-=]*>+|<+[-=]*>*)""")

  //  Emoticons
  val normalEyes = "(?iu)[:=]"
  val wink = "[;]"
  val noseArea = "(|o|O|-|[^a-zA-Z0-9 ])"
  val happyMouths = """[D\)\]]+"""
  val sadMouths = """[\(\[]+"""
  val tongue = "[pP]"
  val otherMouths = """[doO/\\]+""" // remove forward slash if http://'s aren't cleaned

  val emoticon = OR(
      // Standard version  :) :( :] :D :P
      OR(normalEyes, wink) + noseArea + OR(tongue, otherMouths, sadMouths, happyMouths),
      
      // reversed version (: D:  use positive lookbehind to remove "(word):"
      // because eyes on the right side is more ambiguous with the standard usage of : ;
      """(?<=( |^))""" + OR(sadMouths,happyMouths,otherMouths) + noseArea + OR(normalEyes, wink) 
      
      // TODO japanese-style emoticons
      // TODO should try a big precompiled lexicon from Wikipedia, Dan Ramage told me (BTO) he does this
  	)
  
val Protected  = new Regex(
    OR(
      Hearts,
      Arrows,
      emoticon,
      url,
      Email,
      entity,
      timeLike,
      numNum,
      numberWithCommas,
      punctSeq,
      arbitraryAbbrev,
      separators,
      decorations,
      embeddedApostrophe,
      Hashtag, 
      AtMention
     ))

-}

{-
Split the tweet up into lists of "bad" and "good" sections,
where the "good" sections can be further processed, and
the "bad" ones are left as is. Another way of thinking
about "bad" is that these are protected in some way;
e.g. URLs, the value 1.0, U.N.K.L.E., 12:53
-}

combineChars :: [Either Protected Char] -> [Either Protected T.Text]
combineChars = go []
  where
    -- We want to combine the characters into strings, as well as
    -- do some post-processing in order to fix up some issues
    -- with the parsing, namely
    --   '(foo' or 'foo)' -> '( foo' or 'foo )'
    --
    go [] [] = []
    go xs [] = [Right (T.pack (reverse xs))] 
    go [] (Left y:ys) = Left y : go [] ys
    go xs (Left y:ys) = Right (T.pack (reverse xs)) : Left y : go [] ys
    
    go xs (Right '(':ys) = go (' ':'(':' ':xs) ys
    go xs (Right ')':ys) = go (' ':')':' ':xs) ys
    
    go xs (Right c:ys) = go (c:xs) ys

splitWords :: [Either Protected T.Text] -> [Either Protected T.Text]
splitWords = go
  where
    go [] = []
    go (x@(Left _):xs) = x : go xs 
    go (Right x : xs) = map Right (T.words (T.strip x)) ++ go xs

simpleTokenize :: T.Text -> [Either Protected T.Text]
simpleTokenize ts =
  let sts = splitEdgePunct ts
      
      p = (Left <$> protected) <|> (Right <$> A.anyChar)
      
      -- res is Either String [Either Protected Char]
      -- and we need to combine the char's
      res = A.parseOnly (A.many1 p) $ T.cons ' ' $ T.snoc sts ' '
      
  in case res of
    Left emsg -> error emsg
    Right r -> splitWords $ combineChars r
    
{-
Return a version of the input line where each "token"
is space separated, so that

  "@foo: hi there(#bob)"

becomes

  "@foo : hi there ( #bob )"

-}

tokenize :: T.Text -> [T.Text]
tokenize = map (either fromProtected id) . simpleTokenize
           
separateLine :: T.Text -> T.Text
separateLine = T.unwords . tokenize