+-------- proposal for better buffer-switching commands:
+implement what VC++ currently has. you have a single "switch" command like
+CTRL-TAB, which as long as you hold the CTRL button down, brings successive
+buffers that are "next in line" into the current position, bumping the rest
+forward. once you release the CTRL key, the chain is broken, and further
+CTRL-TABs will start from the beginning again. this way, frequently used
+buffers naturally move toward the front of the chain, and you can switch
+back and forth between two buffers using CTRL-TAB. the only thing about
+CTRL-TAB is it's a bit awkward. the way to implement is to have
+modifier-up strokes fire off a hook, like modifier-up-hook. this is driven
+by event dispatch, so there are no synchronization issues. when C-tab is
+pressed, the binding function does something like set a one-shot handler on
+the modifier-up-hook (perhaps separate hooks for separate modifiers?).
+to do this, we'd also want to change the buffer tabs so that they maintain
+their own order. in particular, they start out synched to the regular
+order, but as you make changes, you don't want the tabs to change
+order. (in fact, they may already do this.) selecting a particular buffer
+from the buffer tabs DOES make the buffer go to the head of the line. the
+invariant is that if the tabs are displaying X items, those X items are the
+first X items in the standard buffer list, but may be in a different
+order. (it looks like the tabs may already implement all of this.)
+- test all eol detection stuff under windows w/ and w/o mule, unix w/ and
+ w/o mule. (test configure flag, command-line flag, menu option) may need
+ a way of pretending to be unix under cygwin.
+- test under windows w/ and w/o mule, cygwin w/ and w/o mule, cygwin x
+ windows w/ and w/o mule.
+- test undecided-dos/unix/mac.
+- check ESC ESC works as isearch-quit under TTY's.
+- test coding-system-base and all its uses (grep for them).
+- menu item to revert to most recent auto save.
+- consider renaming build_string -> build_intstring and build_c_string to
+ build_string. (consistent with build_msg_string et al; many more
+ build_c_string than build_string)
+fixed problem causing crash due to invalid internal-format data, fixed an
+existing bug in valid_char_p, and added checks to more quickly catch when
+invalid chars are generated. still need to investigate why
+mswindows-multibyte is being detected.
+i now see why -- we only process 65536 bytes due to a constant
+MAX_BYTES_PROCESSED_FOR_DETECTION. instead, we should have no limit as
+long as we have a seekable stream. we also need to write
+stderr_out_lisp(), used in the debug info routines i wrote.
+check once more about DEBUG_XEMACS. i think debugging info should be
+ON by default. make sure it is. check that nothing untoward will result
+in a production system, e.g. presumably assert()s should not really abort().
+(!! Actually, this should be runtime settable! Use a variable for this, and
+it can be set using the same XEMACSDEBUG method. In fact, now that I think
+of it, I'm sure that debugging info should be on always, with runtime ways
+of turning on or off any funny behavior.)
+fixed various bugs preventing packages from being able to be built. still
+another bug, with psgml/etc/cdtd/docbook, which contains some strange
+characters starting around char pos 110,000. It gets detected as
+mswindows-multibyte (wrong! why?) and then invalid internal-format data is
+generated. need to fix mswindows-multibyte (and possibly add something
+that signals an error as well; need to work on this error-signalling
+mechanism) and figure out why it's getting detected as such. what i should
+do is add a debug var that outputs blow-by-blow info of the detection
+the stuff with global-window-system-map doesn't appear to work. in any
+case it needs better documentation. [DONE]
+M-home, M-end do work, but cause cl-macs to get loaded. why?
+finished the coding system changes and they finally work!
+need to implement undecided-unix/dos/mac. they should be easy to do; it
+should be enough to specify an eol-type but not do-eol, but check this.
+consider making the standard naming be foo-lf/crlf/cr, with unix/dos/mac as
+print methods for coding systems should include some of the generic
+properties. (also then fix print_..._within_print_method). [DONE]
+in a little while, go back and delete the text-file-wrapper-coding-system
+code. (it'll be in CVS if necessary to get at it.) [DONE]
+need to verify at some point that non-text-file coding systems work
+properly when specified. when gzip is working, this would be a good test
+case. (and consider creating base64 as well!)
+remove extra crap from coding-system-category that checks for chain coding
+perhaps make a primitive that gets at coding-system-canonical. [DONE]
+need to test cygwin, compiling the mule packages, get unix-eol stuff
+working. frank from germany says he doesn't see a lisp backtrace when he
+gets an error during temacs? verify that this actually gets outputted.
+consider putting the current language on the modeline, mousable so it can
+be switched. also consider making the coding system be mousable and the
+line number (pick a line) and the percentage (pick a percentage).
+added code so that debug_print() will output a newline to the mswindows
+debugging output, not just the console. need to test. [DONE]
+working on problem where all files are being detected as binary. the
+problem may be that the undecided coding system is getting wrapped with an
+auto-eol coding system, which it shouldn't be -- but even in this
+situation, we should get the right results! check the
+canonicalize-after-coding methods. also, determine_real_coding_system
+appears to be getting called even when we're not detecting encoding. also,
+undecided needs a print method to show its params, and chain needs to be
+updated to show canonicalize_after_coding. check others as well. [DONE]
+finished up coding system changes, testing.
+errors byte-compiling files in iso-2022-7-bit. perhaps it's not correctly
+noticed a problem in the dfc macros: we call
+get_coding_system_for_text_file with eol_wrap == 1, to allow for
+auto-detection of the eol type; but this defeats the check and
+short-circuit for unicode.
+still need to implement calling determine_real_coding_system() for
+non-seekable streams. to implement correctly, we need to do our own
+buffering. [DONE, BUT WITHOUT BUFFERING]
+implemented most stuff below.
+need to finish up changes to make_coding_system_1. (i changed the way
+internal coding systems were handled; i need to create subsidiaries for all
+types of coding systems, not just text ones.) there's a nasty xfree() crash
+i was hitting; perhaps it'll go away once all stuff has been rewritten.
+check under cygwin to make sure that when an error occurs during loadup, a
+as soon as andy releases his new setup, we should put it onto various
+standard windows software repositories.
+added global-tty-map and global-window-system-map. add some stuff to the
+maps, e.g. C-x ESC for repeat vs. C-x ESC ESC on TTY's, and of course ESC
+ESC on window systems vs. ESC ESC ESC on TTY's. [TEST]
+was working on integrating the two help-for-tutorial versions (mule,
+non-mule). [DONE, but test under non-Mule]
+was working on the file-coding changes. need to think more about
+text-file-wrapper. conclusion i think is that
+get_coding_system_for_text_file should wrap using a special coding system
+type called a text-file-wrapper, which inherits from chain, and implements
+canonicalize-after-decoding to just return the unwrapped coding system. We
+need to implement inheritance of coding systems, which will certainly come
+in extremely useful when coding systems get implemented in Lisp, which
+should happen at some point. (see existing docs about this.) essentially,
+we have a way of declaring that we inherit from some system, and the
+appropriate data structures get created, perhaps just an extra inheritance
+pointer. but when we create the coding system, the extra data needs to be
+a stretchy array of offsets, pointing to the type-specific data for the
+coding system type and all its parents. that means that in the methods
+structure for a coding system (which perhaps should be expanded beyond
+method, it's just a "class structure") is the index in these arrays of
+offsets. CODING_SYSTEM_DATA() can take any of the coding system classes
+(rename type to class!) that make up this class. similarly, a coding
+system class inherits its methods from the class above unless specifying
+its own method, and can call the superclass method at any point by either
+just invoking its name, or conceivably by some macro like
+CALL_SUPER (method, (args))
+similar mods would have to be made to coding stream structures.
+perhaps for the immediate we can just sort of fake things like we currently
+do with undecided calling some stuff from chain.
+need to implement support for iso-8859-15, i.e. iso-8859-1 + euro symbol.
+figure out how to fall back to iso-8859-1 as necessary.
+leave the current bindings the way they are for the moment, but bump off
+M-home and M-end (hardly used), and substitute my buffer movement stuff
+there's something to be said for combining block of 6 and paragraph,
+esp. if we make the definition of "paragraph" be so that it skips by 6 when
+eliminate advertised-undo crap, and similar hacks. [DONE]
+think about obsolete stuff to be eliminated. think about eliminating or
+dimming obsolete items from hyper-apropos and something similar in
+synched up the tutorials with FSF 21.0.105. was rewriting them to favor
+the cursor keys over the older C-p, etc. keys.
+Got thinking about key bindings again.
+(1) I think that M-up/down and M-C-up/down should be reversed. I use
+ scroll-up/down much more often than motion by paragraph.
+(2) Should we eliminate move by block (of 6) and subsitute it for
+ paragraph? This would have the advantage that I could make bindings
+ for buffer change (forward/back buffer, perhaps M-C-up/down. with
+ shift, M-C-S-up/down only goes within the same type (C files, etc.).
+ alternatively, just bump off beginning-of-defun from C-M-home, since
+need someone to go over the other tutorials (five new ones, from FSF
+21.0.105) and fix them up to correspond to the english one.
+shouldn't shift-motion work with C-a and such as well as arrows?
+charcount_to_bytecount can also be made to scream -- as can scan_buffer,
+buffer_mule_signal_inserted_region, others? we should start profiling
+though before going too far down this line.
+Debug code that causes no slowdown should in general remain in the
+executable even in the release version because it may be useful (e.g. for
+people to see the event output). so DEBUG_XEMACS should be rethought.
+things like use of msvcrtd.dll should be controlled by error_checking on.
+maybe DEBUG_XEMACS controls general debug code (e.g. use of msvcrtd.dll,
+asserts abort, error checking), and the actual debugging code should remain
+always, or be conditonalized on something else
+doc strings in dumped files are displayed with an extra blank line between
+each line. presumably this is recent? i assume either the change to
+detect-coding-region or the double-wrapping mentioned below.
+error with coding-system-property on iso-2022-jp-dos. problem is that that
+coding system is wrapped, so its type shows up as chain, not iso-2022.
+this is a general problem, and i think the way to fix it is to in essence
+do late canonicalization -- similar in spirit to what was done long ago,
+canonicalize_when_code, except that the new coding system (the wrapper) is
+created only once, either when the original cs is created or when first
+needed. this way, operations on the coding system work like expected, and
+you get the same results as currently when decoding/encoding. the only
+thing tricky is handling canonicalize-after-coding and the ever-tricky
+double-wrapping problem mentioned below. i think the proper solution is to
+move the autodetection of eol into the main autodetect type. it can be
+asked to autodetect eol, coding, or both. for just coding, it does like it
+currently does. for just eol, it does similar to what it currently does
+but runs the detection code that convert-eol currently does, and selects
+the appropriate convert-eol system. when it does both eol and coding, it
+does something on the order of creating two more autodetect coding systems,
+one for eol only and one for coding only, and chains them together. when
+each has detected the appropriate value, the results are combined. this
+automatically eliminates the double-wrapping problem, removes the need for
+complicated canonicalize-after-coding stuff in chain, and fixes the problem
+of autodetect not having a seekable stream because hidden inside of a
+chain. (we presume that in the both-eol-and-coding case, the various
+autodetect coding streams can communicate with each other appropriately.)
+also, we should solve the problem of internal coding systems floating
+around and clogging up the list simply by having an "internal" property on
+cs's and an internal param to coding-system-list (optional; if not given,
+you don't get the internal ones). [DONE]
+we should try to reduce the size of the from-unicode tables (the dominant
+memory hog in the tables). one obvious thing is to not store a whole
+emchar as the mapped-to value, but a short that encodes the octets. [DONE]
+need to merge up to latest in trunk.
+add unicode charsets for all non-translatable unicode chars; probably want
+to extend the concept of charsets to allow for dimension 3 and dimension 4
+charsets. for the moment we should stick with just dimension 3 charsets;
+otherwise we run past the current maximum of 4 bytes per emchar. (most code
+would work automatically since it uses MAX_EMCHAR_LEN; the trickiness is in
+certain code that has intimate knowledge of the representation.
+e.g. bufpos_to_bytind() has to multiply or divide by 1, 2, 3, or 4,
+and has special ways of handling each number. with 5 or 6 bytes per char,
+we'd have to change that code in various ways.) 96x96x96 = 884,000 or so,
+so with two 96x96x96 charsets, we could tackle all Unicode values
+representable by UTF-16 and then some -- and only these codepoints will
+ever have assigned chars, as far as we know.
+need an easy way of showing the current language environment. some menus
+need to have the current one checked or whatever. [DONE]
+implement unicode surrogates.
+implement buffer-file-coding-system-when-loaded -- make sure find-file,
+revert-file, etc. set the coding system [DONE]
+verify all the menu stuff [DONE]
+implemented the entirely-ascii check in buffers. not sure how much gain
+it'll get us as we already have a known range inside of which is constant
+time, and with pure-ascii files the known range spans the whole buffer.
+improved the comment about how bufpos-to-bytind and vice-versa work. [DONE]
+fix double-wrapping of convert-eol: when undecided converts itself to
+something with a non-autodetect eol, it needs to tell the adjacent
+convert-eol to reduce itself to nothing.
+need menu item for find file with specified encoding. [DONE]
+renamed coding systems mswindows-### to windows-### to follow the standard
+implemented coding-system-subsidiary-parent [DONE]
+HAVE_MULE -> MULE in files in nt/ so that depend checking works [DONE]
+need to take the smarter search-all-files-in-dir stuff from my sample init
+file and put it on the grep menu [DONE]
+added item for revert w/specified encoding; mostly works, but needs fixes.
+in particular, you get the correct results, but buffer-file-coding-system
+does not reflect things right. also, there are too many entries. need to
+split into submenus. there is already split code out there; see if it's
+generalized and if not make it so. it should only split when there's more
+than a specified number, and when splitting, split into groups of a
+specified size, not into a specified number of groups. [DONE]
+too many entries in the langenv menus; need to split. [DONE]
+NOTE: M-x grep for make-string causes crash now. something definitely to
+do with string changes. check very carefully the diffs and put in those
+sledgehammer checks. [DONE]
+fix font-lock bug i introduced. [DONE]
+added optimization to strings (keeps track of # of bytes of ascii at the
+beginning of a string). perhaps should also keep an all-ascii flag to deal
+with really large (> 2 MB) strings. rewrite code to count ascii-begin to
+use the 4-or-8-at-a-time stuff in bytecount_to_charcount.
+Error: M-q is causing Invalid Regexp error on the above paragraph. It's
+not in working. I assume it's a side effect of the string stuff. VERIFY!
+Write sledgehammer checks for strings. [DONE]
+revamped the locale/init stuff so that it tries much harder to get things
+right. should test a bit more. in particular, test out Describe Language
+on the various created environments and make sure everything looks right.
+should change the menus: move the submenus on Edit->Mule directly under
+Edit. add a menu entry on File to say "Reload with specified encoding ->".
+Also Find File with specified encoding -> Also entry to change the EOL
+settings for Unix, and implement it.
+decode-coding-region isn't working because it needs to insert a binary
+(char->byte) converter. [DONE]
+chain should be rearranged to be in decoding order; similar for
+source/sink-type, other things?
+the detector should check for a magic cookie even without a seekable input.
+(currently its input is not seekable, because it's hidden within a chain.
+#### See what we can do about this.)
+provide a way to display various settings, e.g. the current category
+mappings and priority (see mule-diag; get this working so it's in the
+path); also a way to print out the likeliness results from a detection,
+problem with `env', which causes path issues due to `env' in packages.
+move env code to process, sync with fsf 21.0.105, check that the autoloads
+in `env' don't cause problems. [DONE]
+8-bit iso2022 detection appears broken; or at least, mule-canna.c is not so
+something else to do is review the font selection and fix it so that (e.g.)
+JISX-0212 can be displayed.
+also, text in widgets needs to be drawn by us so that the correct fonts
+will be displayed even in multi-lingual text.
+the detection system is now properly abstracted. the detectors have been
+rewritten to include multiple levels of abstraction. now we just need
+detectors for ascii, binary, and latin-x, as well as more sophisticated
+detectors in general and further review of the general algorithm for doing
+detection. (#### Is this written up anywhere?) after that, consider adding
+error-checking to decoding (VERY IMPORTANT) and verifying the binary
+correctness of things under unix no-mule.
+began to fix the detection system -- adding multiple levels of likelihood
+and properly abstracting the detectors. the system is in place except for
+the abstraction of the detector-specific data out of the struct
+detection_state. we should get things working first before tackling that
+(which should not be too hard). i'm rewriting algorithms here rather than
+just converting code, so it's harder. mostly done with everything, but i
+need to review all detectors except iso2022 and make them properly follow
+the new way. also write a no-conversion detector. also need to look into
+the `recode' package and see how (if?) they handle detection, and maybe
+copy some of the algorithms. also look at recent FSF 21.0 and see if their
+algorithms have improved.
+fixed gc bugs from yesterday.
+close/finalize stuff works.
+eliminated notyet stuff in syswindows.h.
+eliminated special code in tstr_to_c_string.
+fixed pdump problems. (many of them, mostly latent bugs, ugh)
+fixed cygwin sscanf problems in parse-unicode-translation-table. (NOT a
+sscanf bug, but subtly different behavior w.r.t. whitespace in the format
+string, combined with a debugger that sucks ROCKS!! and consistently
+outputs garbage for variable values.)
+main stuff to test is the handling of EOF recognition vs. binary
+(i.e. check what the default settings are under Unix). then we may have
+something that WORKS on all platforms!!! (Also need to test Windows
+finished redoing the close/finalize stuff in the lstream code. but i
+encountered again the nasty bug mentioned on sep 15 that disappeared on its
+own then. the problem seems to be that the finalize method of some of the
+lstreams is calling Lstream_delete(), which calls free_managed_lcrecord(),
+which is a no-no when we're inside of garbage-collection and the object
+passed to free_managed_lcrecord() is unmarked, and about to be released by
+the gc mechanism -- the free lists will end up with xfree()d objects on
+them, which is very bad. we need to modify free_managed_lcrecord() to
+check if we're in gc and the object is unmarked, and ignore it rather than
+move it to the free list. [DONE]
+(#### What we really need to do is do what Java and C# do w.r.t. their
+finalize methods: For objects with finalizers, when they're about to be
+freed, leave them marked, run the finalizer, and set another bit on them
+indicating that the finalizer has run. Next GC cycle, the objects will
+again come up for freeing, and this time the sweeper notices that the
+finalize method has already been called, and frees them for good (provided
+that a finalize method didn't do something to make the object alive
+redid the lstream code so there is only one coding stream. combined the
+various doubled coding stream methods into one; i'm a little bit unsure of
+this last part, though, as the results of combining the two together seem
+unclean. got it to compile, but it crashes in loadup. need to go through
+and rehash the close vs. finalize stuff, as the problem was stuff getting
+freed too quickly, before the canonicalize-after-decoding was run. should
+eliminate entirely CODING_STATE_END and use a different method (close
+coding stream). rewrite to use these two. make sure they're called in the
+right places. Lstream_close on a stream should *NOT* do finalizing.
+finalize only on delete. [DONE]
+in general i'd like to see the flags eliminated and converted to
+bit-fields. also, rewriting the methods to take advantage of rejecting
+should make it possible to eliminate much of the state in the various
+methods, esp. including the flags. need to test this is working, though --
+reduce the buffer size down very low and try files with only CRLF's in
+them, with one offset by a byte from the other, and see if we correctly
+still have the problem with incorrectly truenaming files.
+bug reported: crash while closing lstreams.
+the lstream/coding system close code needs revamping. we need to document
+that order of closing lstreams is very important, and make sure we're
+consistent. furthermore, chain and undecided lstreams need to close their
+underneath lstreams when they receive the EOF signal (there may be data in
+the underneath streams waiting to come out), not when they themselves are
+(if only we had proper inheritance. i think in any case we should
+simulate it for the chain coding stream -- write things in such a way that
+undecided can use the chain coding stream and not have to duplicate
+in general we need to carefully think through the closing process to make
+sure everything always works correctly and in the right order. also check
+very carefully to make sure there are no dangling pointers to deleted
+objects floating around.
+move the docs for the lstream functions to the functions themselves, not
+the header files. document more carefully what exactly Lstream_delete()
+means and how it's used, what the connections are between Lstream_close(),
+Lstream_delete(), Lstream_flush(), lstream_finalize, etc. [DONE]
+additional error-checking: consider deadbeefing the memory in objects
+stored in lcrecord free lists; furthermore, consider whether lifo or fifo
+is correct; under error-checking, we should perhaps be doing fifo, and
+setting a minimum number of objects on the lists that's quite large so that
+it's highly likely that any erroneous accesses to freed objects will go
+into such deadbeefed memory and cause crashes. also, at the earliest
+available opportunity, go through all freed memory and check for any
+consistency failures (overwrites of the deadbeef), crashing if so. perhaps
+we could have some sort of id for each block, to easier trace where the
+offending block came from. (all of these ideas are present in the debug
+system malloc from VC++, plus more stuff.) there's similar code i wrote
+sitting somewhere (in free-hook.c? doesn't appear so. we need to delete the
+blocking stuff out of there!). also look into using the debug system
+malloc from VC++, which has lots of cool stuff in it. we even have the
+sources. that means compiling under pdump, which would be a good idea
+anyway. set it as the default. (but then, we need to remove the
+requirement that Xpm be a DLL, which is extremely annoying. look into
+test the windows code page coding systems recently created.
+problems reading my mail files -- 1personal appears to hang, others come up
+with lots of ^M's. investigate.
+test the enum functions i just wrote, and finish them.
+critical-quit broken sometime after aug 25.
+-- fixed process problems.
+-- print routines work. (no routine for ccl, though)
+-- can read and write unicode files, and they can still be read by some
+-- defaults should come up correctly -- mswindows-multibyte is general.
+still need to test matej's stuff.
+seems ok with multibyte stuff but needs more testing.
+!!!!! something broken with processes !!!!! cannot send mail anymore. must
+on mon/wed nights, stop *BEFORE* 11pm. Otherwise i just start getting
+woozy and can't concentrate.
+just finished getting assorted fixups to the main branch committed, so it
+will compile under C++ (Andy committed some code that broke C++ builds).
+cup'd the code into the fixtypes workspace, updated the tags appropriately.
+i've created the appropriate log message, sitting in fixtypes.txt in
+/src/xemacs; perhaps it should go into a README. now i just have to build
+on everything (it's currently building), verify it's ok, run patcher-mail,
+my mule ws is also very close. need to:
+-- test the new print routines.
+-- test it can read and write unicode files, and they can still be read by
+-- try to see if unicode can be auto-detected properly.
+-- test it can read and write multibyte files in a few different formats.
+ currently can't recognize them, but if you set the cs right, it should
+-- examine the test files sent by matej and see if we can handle them.
+more eol fixing. this stuff is utter crap.
+currently we wrap coding systems with convert-eol-autodetect when we create
+them in make_coding_system_1. i had a feeling that this would be a
+problem, and indeed it is -- when autodetecting with `undecided', for
+example, we end up with multiple layers of eol conversion. to avoid this,
+we need to do the eol wrapping *ONLY* when we actually retrieve a coding
+system in places such as insert-file-contents. these places are
+insert-file-contents, load, process input, call-process-internal,
+encode/decode/detect-coding-region, database input, ...
+(later) it's fixed, and things basically work. NOTE: for some reason,
+adding code to wrap coding systems with convert-eol-lf when eol-type == lf
+results in crashing during garbage collection in some pretty obscure place
+-- an lstream is free when it shouldn't be. this is a bad sign. i guess
+something might be getting initialized too early?
+we still need to fix the canonicalization-after-decoding code to avoid
+problems with coding systems like `internal-7' showing up. basically, when
+eol==lf is detected, nil should be returned, and the callers should handle
+it appropriately, eliding when necessary. chain needs to recognize when
+it's got only one (or even 0) items in the chain, and elide out the chain.
+sep 11, 2001: the day that will live in infamy.
+rewrite of sep 9 entry about formats:
+when calling make-coding-system, the name can be a cons of (format1 .
+format2), specifying that it decodes format1->format2 and encodes the other
+way. if only one name is given, that is assumed to be format1, and the
+other is either `external' or `internal' depending on the end type.
+normally the user when decoding gives the decoding order in formats, but
+can leave off the last one, `internal', which is assumed. a multichain
+might look like gzip|multibyte|unicode, using the coding systems named
+`gzip', `(unicode . multibyte)' and `unicode'. the way this actually works
+is by searching for gzip->multibyte; if not found, look for gzip->external
+or gzip->internal. (In general we automatically do conversion between
+internal and external as necessary: thus gzip|crlf does the expected, and
+maps to gzip->external, external->internal, crlf->internal, which when
+fully specified would be gzip|external:external|internal:crlf|internal --
+see below.) To forcibly fit together two converters that have explicitly
+specified and incompatible names (say you have unicode->multibyte and
+iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
+case are compatible), you can force-cast using :, like this:
+ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
+internal and external formats, the conversion happens automatically.)
+moved the autodetection stuff (both codesys and eol) into particular coding
+systems -- `undecided' and `convert-eol' (type == `autodetect'). needs
+lots of work. still need to search through the rest of the code and find
+any remaining auto-detect code and move it into the undecided coding
+system. need to modify make-coding-system so that it spits out
+auto-detecting versions of all text-file coding systems unless we say not
+to. need eliminate entirely the EOF flag from both the stream info and the
+coding system; have only the original-eof flag. in
+coding_system_from_mask, need to check that the returned value is not of
+type `undecided', falling back to no-conversion if so. also need to make
+sure we wrap everything appropriate for text-files -- i removed the
+wrapping on set-coding-category-list or whatever (need to check all those
+files to make sure all wrapping is removed). need to review carefully the
+new code in `undecided' to make sure it works are preserves the same logic
+as previously. need to review the closing and rewinding behavior of chain
+and undecided (same -- should really consolidate into helper routines, so
+that any coding system can embed a chain in it) -- make sure the dynarr's
+are getting their data flushed out as necessary, rewound/closed in the
+right order, no missing steps, etc.
+also split out mule stuff into mule-coding.c. work done on
+configure/xemacs.mak/Makefiles not done yet. work on emacs.c/symsinit.h to
+interface with the new init functions not done yet.
+also put in a few declarations of the way i think the abstracted detection
+stuff ought to go. DON'T WORK ON THIS MORE UNTIL THE REST IS DEALT WITH
+AND WE HAVE A WORKING XEMACS AGAIN WITH ALL EOL ISSUES NAILED.
+really need a version of cvs-mods that reports only the current directory.
+WRITE THIS! use it to implement a better cvs-checkin.
+implemented a gzip coding system. unfortunately, doesn't quite work right
+because it doesn't handle the gzip headers -- it just reads and writes raw
+zlib data. there's no function in the library to skip past the header, but
+we do have some code out of the library that we can snarf that implements
+header parsing. we need to snarf that, store it, and output it again at
+the beginning when encoding. in the process, we should create a "get next
+byte" macro that bails out when there are no more. using this, we set up a
+nice way of doing most stuff statelessly -- if we have to bail, we reject
+everything back to the sync point. also need to fix up the autodetection
+of zlib in configure.in.
+BIG problems with eol. finished up everything i thought i would need to
+get eol stuff working, but no -- when you have mswindows-unicode, with its
+eol set to autodetect, the detection routines themselves do the autodetect
+(first), and fail (they report CR on CRLF because of the NULL byte between
+the CR and the LF) since they're not looking at ascii data. with a chain
+it's similarly bad. for mswindows-multibyte, for example, which is a chain
+unicode->unicode-to-multibyte, autodetection happens inside of the chain,
+both when unicode and unicode-to-multibyte are active. we could twiddle
+around with the eol flags to try to deal with this, but it's gonna be a big
+mess, which is exactly what we're trying to avoid. what we basically want
+is to entirely rip out all EOL settings from either the coding system or
+the stream (yes, there are two! one might saw autodetect, and then the
+stream contains the actual detected value). instead, we simply create an
+eol-autodetect coding system -- or rather, it's part of the convert-eol
+coding system. convert-eol, type = autodetect, does autodetection the
+first time it gets data sent to it to decode, and thereafter sets a stream
+parameter indicating the actual eol type for this stream. this means that
+all autodetect coding systems, as created by `make-coding-system', really
+are chains with a convert-eol at the beginning. only subsidiary xxx-unix
+has no wrapping at all. this should allow eof detection of gzip, unicode,
+etc. for that matter, general autodetection should be entirely
+encapsulated inside of the `autodetect' coding system, with no
+eol-autodetection -- the chain becomes convert-eol (autodetect) ->
+autodetect or perhaps backwards. the generic autodetect similarly has a
+coding-system in its stream methods, and needs somehow or other to insert
+the detected coding-system into the chain. either it contains a chain
+inside of it (perhaps it *IS* a chain), or there's some magic involving
+canonicalization-type switcherooing in the middle of a decode. either way,
+once everything is good and done and we want to save the coding system so
+it can be used later, we need to do another sort of canonicalization --
+converting auto-detect-type coding systems into the detected systems.
+again, a coding-system method, with some magic currently so that
+subsidiaries get properly used rather than something that's new but
+equivalent to subsidiaries. (#### perhaps we could use a hash table to
+avoid recreating coding systems when not necessary. but that would require
+that coding systems be immutable from external, and i'm not sure that's the
+i really think, after all, that i should reverse the naming of everything
+in chain and source-sink-type -- they should be decoding-centric. later
+on, if/when we come up with the proper way to make it totally symmetrical,
+we'll be fine whether before then we were encoding or decoding centric.
+investigated eol parameter.
+implemented handling in make-coding-system of eol-cr and eol-crlf.
+fixed calls everywhere to Fget_coding_system / Ffind_coding_system to
+reject non-char->byte coding systems.
+still need to handle "query eol type using coding-system-property" so it
+magically returns the right type by parsing the chain.
+no work done on formats, as mentioned below. we should consider using :
+instead of || to indicate casting.
+renamed some codesys properties: `list' in chain -> chain; `subtype' in
+unicode -> type. everything compiles again and sort of works; some CRLF
+problems that may resolve themselves when i finish the convert-eol stuff.
+the stuff to create subsidiaries has been rewritten to use chains; but i
+still need to investigate how the EOL type parameter is used. also, still
+need to implement this: when a coding system is created, and its eol type
+is not autodetect or lf, a chain needs to be created and returned. i think
+that what needs to happen is that the eol type can only be set to
+autodetect or lf; later on this should be changed to simply be either
+autodetect or not (but that would require ripping out the eol converting
+stuff in the various coding systems), and eventually we will do the work on
+the detection mechanism so it can do chain detection; then we won't need an
+eol autodetect setting at all. i think there's a way to query the eol type
+of a coding system; this should check to see if the coding system is a
+chain and there's a convert-eol at the front; if so, the eol type comes
+from the type of the convert-eol.
+also check out everywhere that Fget_coding_system or Ffind_coding_system is
+called, and see whether anything but a char->byte system can be tolerated.
+create a new function for all the places that only want char->byte,
+something like get_coding_system_char_to_byte_only.
+think about specifying formats in make-coding-system. perhaps the name can
+be a cons of (format1, format2), specifying that it encodes
+format1->format2 and decodes the other way. if only one name is given,
+that is assumed to be format2, and the other is either `byte' or `char'
+depending on the end type. normally the user when decoding gives the
+decoding order in formats, but can leave off the last one, `char', which is
+assumed. perhaps we should say `internal' instead of `char' and `external'
+instead of byte. a multichain might look like gzip|multibyte|unicode,
+using the coding systems named `gzip', `(unicode . multibyte)' and
+`unicode'. we would have to allow something where one format is given only
+as generic byte/char or internal/external to fit with any of the same
+byte/char type. when forcibly fitting together two converters that have
+explicitly specified and incompatible names (say you have
+unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte
+and iso8859-1 in this case are compatible), you can force-cast using ||,
+like this: ebcdic|iso8859-1||multibyte|unicode. this will also force
+external->internal translation as necessary:
+unicode|multibyte||crlf|internal does unicode->multibyte,
+external->internal, crlf->internal. perhaps you'd need to put in the
+internal translation, like this: unicode|multibyte|internal||crlf|internal,
+which means unicode->multibyte, external->internal (multibyte is compatible
+with external); force-cast to crlf format and convert crlf->internal.
+even later: Sep 8, 2001:
+chain doesn't need to set character mode, that happens automatically when
+the coding systems are created. fixed chain to return correct source/sink
+type for itself and to check the compatibility of source/sink types in its
+chain. fixed decode/encode-coding-region to check the source and sink
+types of the coding system performing the conversion and insert appropriate
+byte->char/char->byte converters (aka "binary" coding system). fixed
+set-coding-category-system to only accept the traditional
+encode-char-to-byte types of coding systems.
+still need to extend chain to specify the parameters mentioned below,
+esp. "reverse". also need to extend the print mechanism for chain so it
+prints out the chain. probably this should be general: have a new method
+to return all properties, and output those properties. you could also
+implement a read syntax for coding systems this way.
+still need to implement convert-eol and finish up the rest of the eol stuff
+later September 7, 2001: (more like Sep 8)
+moved many Lisp_Coding_System * params to Lisp_Object. In general this is
+the way to go, and if we ever implement a copying GC, we will never want to
+be passing direct pointers around. With no error-checking, we lose no
+cycles using Lisp_Objects in place of pointers -- the Lisp_Object itself is
+nothing but a pointer, and so all the casts and "dereferences" boil down to
+Clarified and cleaned up the "character mode" on streams, and documented
+who (caller or object itself) has the right to be setting character mode on
+a stream, depending on whether it's a read or write stream. changed
+conversion_end_type method and enum source_sink_type to return
+encoding-centric values, rather than decoding-centric. for the moment,
+we're going to be entirely encoding-centric in everything; we can rethink
+later. fixed coding systems so that the decode and encode methods are
+guaranteed to receive only full characters, if that's the source type of
+the data, as per conversion_end_type.
+still need to fix the chain method so that it correctly sets the character
+mode on all the lstreams in it and checks the source/sink types to be
+compatible. also fix decode-coding-string and friends to put the
+appropriate byte->character (i.e. no-conversion) coding systems on the ends
+as necessary so that the final ends are both character. also add to chain
+a parameter giving the ability to switch the direction of conversion of any
+particular item in the chain (i.e. swap encoding and decoding). i think
+what we really want to do is allow for arbitrary parameters to be put onto
+a particular coding system in the chain, of which the only one so far is
+swap-encode-decode. don't need too much codage here for that, but make the
+just added a return value from the decode and encode methods of a coding
+system, so that some of the data can get rejected. fixed the calling
+routines to handle this. need to investigate when and whether the coding
+lstream is set to character mode, so that the decode/encode methods only
+get whole characters. if not, we should do so, according to the source
+type of these methods. also need to implement the convert_eol coding
+system, and fix the subsidiary coding systems (and in general, any coding
+system where the eol type is specified and is not LF) to be chains
+after everything is working, need to remove eol handling from encode/decode
+methods and eventually consider rewriting (simplifying) them given the
+-- need to organize this. get everything below into the TODO list.
+ CVS the TODO list frequently so i can delete old stuff. prioritize
+-- move README.ben-mule... to STATUS.ben-mule...; use README for
+ intro, overview of what's new, what's broken, how to use the
+-- need a global and local coding-category-precedence list, which get
+-- finished the BOM support. also finished something not listed
+ below, expansion to the auto-generator of Unicode-encapsulation to
+ support bracketing code with #if ... #endif, for Cygwin and MINGW
+ problems, e.g. This is tested; appears to work.
+-- need to add more multibyte coding systems now that we have various
+ properties to specify them. need to add DEFUN's for mac-code-page
+ and ebcdic-code-page for completeness. need to rethink the whole
+ way that the priority list works. it will continue to be total
+ junk until multiple levels of likeliness get implemented.
+-- need to finish up the stuff about the various defaults. [need to
+ investigate more generally where all the different default values
+ are that control encoding. (there are six places or so.) need to
+ list them in make-coding-system docs and put pointers
+ elsewhere. [[[[#### what interface to specify that this default
+ should be unicode? a "Unicode" language environment seems too
+ drastic, as the language environment controls much more.]]]] even
+ skipping the Unicode stuff here, we need to survey and list the
+ variables that control coding page behavior and determine how they
+ need to be set for various possible scenarios:
+ -- total binary: no detection at all.
+ -- raw-text only: wants only autodetection of line endings, nothing else.
+ -- "standard Windows environment": tries for Unicode, falls back on
+ -- some sort of East European environment, and Russian.
+ -- some sort of standard Japanese Windows environment.
+ -- standard Chinese Windows environments (traditional and simplified)
+ -- various Unix environments (European, Japanese, Russian, etc.)
+ -- Unicode support in all of these when it's reasonable
+These really require multiple likelihood levels to be fully
+implementable. We should see what can be done ("gracefully fall
+back") with single likelihood level. need lots of testing.
+-- need to fix the truename problem.
+-- lots of testing: need to test all of the stuff above and below that's recently been implemented.
+mostly everything compiles. currently there is a crash in
+parse-unicode-translation-table, and Cygwin/Mule won't run. it may
+well be a bug in the sscanf() in Cygwin.
+-- adding BOM support for Unicode coding systems. mostly there, but
+ need to finish adding BOM support to the detection routines. then test.
+-- adding properties to unicode-to-multibyte to specify the coding
+ system in various flexible ways, e.g. directly specified code page
+ or ansi or oem code page of specified locale, current locale,
+ user-default or system-default locale. need to test.
+-- creating a `multibyte' coding system, with the same parameters as
+ unicode-to-multibyte and which resolves at coding-system-creation
+ time to the appropriate chain. creating the underlying mechanism
+ to allow such under-the-scenes switcheroo. need to test.
+-- set default-value of buffer-file-coding-system to
+ mswindows-multibyte, as Matej said it should be. need to test.
+ need to investigate more generally where all the different default
+ values are that control encoding. (there are six places or so.)
+ need to list them in make-coding-system docs and put pointers
+ elsewhere. #### what interface to specify that this default should
+ be unicode? a "Unicode" language environment seems too drastic, as
+ the language environment controls much more.
+-- thinking about adding multiple levels of certainty to the detection
+ schemes, instead of just a mask. eventually, we need to totally
+ abstract things, but that can easier be done in many steps. (we
+ need multiple levels of likelihood to more reasonably support a
+ Windows environment with code-page type files. currently, in order
+ to get them detected, we have to put them first, because they can
+ look like lots of other things; but then, other encodings don't get
+ detected. with multiple levels of likelihood, we still put the
+ code-page categories first, but they will return low levels of
+ likelihood. Lower-down encodings may be able to return higher
+ levels of likelihood, and will get taken preferentially.)
+-- making it so you cannot disable file-coding, but you get an
+ equivalent default on Unix non-Mule systems where all defaults are
+ `binary'. need to test!!!!!!!!!
+Matej (mostly, + some others) notes the following problems, and here
+-- he wants the defaults to work right. [figure out what those
+ defaults are. i presume they are auto-detection of data in current
+ code page and in unicode, and new files have current code page set
+ as their output encoding.]
+-- too easy to lose data with incorrect encodings. [need to set up an
+ error system for encoding/decoding. extremely important but a
+ little tricky to implement so let's deal with other issues now.]
+-- EOL isn't always detected correctly. [#### ?? need examples]
+-- truename isn't working: c:\t.txt and c:\tmp.txt have the same truename.
+ [should be easy to fix]
+-- unicode files lose the BOM mark. [working on this]
+-- command-line utilities use OEM. [actually it seems more
+ complicated. it seems they use the codepage of the console. we
+ may be able to set that, e.g. to UTF8, before we invoke a command.
+-- no way to handle unicode characters not recognized as charsets. [we
+ need to create something like 8 private 2-dimensional charsets to
+ handle all BMP Unicode chars. Obviously this is a stopgap
+ solution. Switching to Unicode internal will ultimately make life
+ far easier and remove the BMP limitation. but for now it will
+ work. we translate all characters where we have charsets into
+ chars in those charsets, and the remainder in a unicode charset.
+ that way we can save them out again and guarantee no data loss with
+ unicode. this creates font problems, though ...]
+-- problems with xemacs font handling. [xemacs font handling is not
+ sophisticated enough. it goes on a charset granularity basis and
+ only looks for a font whose name contains the corresponding windows
+ charset in it. with unicode this fails in various ways. for one
+ the granularity needs to be single character, so that those unicode
+ charsets mentioned above work; and it needs to query the font to
+ see what unicode ranges it supports, rather than just looking at
+working on getting everything to compile again: Cygwin, non-MULE,
+mswindows-multibyte is now defined using chain, and works. removed
+most vestiges of the mswindows-multibyte coding system type.
+file-coding is on by default; should default to binary only on Unix.
+Need to test. (Needs to compile first :-)
+I've fixed the issue of inputting non-ASCII text under -nuni, and done
+some of the work on the Russian C-x problem -- we now compute the
+other possibilities. We still need to fix the key-lookup code,
+though, and that code is unfortunately a bit ugly. the best way, it
+seems, is to expand the command-builder structure so you can specify
+different interpretations for keys. (if we do find an alternative
+binding, though, we need to mess with both the command builder and
+this-command-keys, as does the function-key stuff. probably need to
+abstract that munging code.)
+-- support for WM_IME_CHAR. IME input can work under -nuni if we use
+ WM_IME_CHAR. probably we should always be using this, instead of
+ snarfing input using WM_COMPOSITION. i'll check this out.
+-- Russian C-x problem. see above.
+-- make sure it compiles and runs under non-mule. remember that some
+ code needs the unicode support, or at least a simple version of it.
+-- make sure it compiles and runs under pdump. see below.
+-- clean up mswindows-multibyte, TSTR_TO_C_STRING. see below. [DONE]
+-- eliminate last vestiges of codepage<->charset conversion and similar stuff.
+-- cut and paste. see below.
+-- misc issues with handling lang environments. see also August 25,
+ "finally: working on the C-x in ...".
+ -- when switching lang env, needs to set keyboard layout.
+ -- user var to control whether, when moving into text of a
+ particular language, we set the appropriate keyboard layout. we
+ would need to have a lisp api for retrieving and setting the
+ keyboard layout, set text properties to indicate the layout of
+ text, and have a way of dealing with text with no property on
+ it. (e.g. saved text has no text properties on it.) basically,
+ we need to get a keyboard layout from a charset; getting a
+ language would do. Perhaps we need a table that maps charsets
+ to language environments.
+ -- test that the lang env is properly set at startup. test that
+ switching the lang env properly sets the C locale (call
+ setlocale(), set LANG, etc.) -- a spawned subprogram should have
+ the new locale in its environment.
+-- look through everything below and see if anything is missed in this
+ priority list, and if so add it. create a separate file for the
+ priority list, so it can be updated as appropriate.
+-- clean up the chain coding system. its list should specify decode
+ order, not encode; i now think this way is more logical. it should
+ check the endpoints to make sure they make sense. it should also
+ allow for the specification of "reverse-direction coding systems":
+ use the specified coding system, but invert the sense of decode and
+-- along with that, places that take an arbitrary coding system and
+ expect the ends to be anything specific need to check this, and add
+ the appropriate conversions from byte->char or char->byte.
+-- get some support for arabic, thai, vietnamese, japanese jisx 0212:
+ at least get the unicode information in place and make sure we have
+ things tied together so that we can display them. worry about r2l
+There is actually more non-Unicode-ized stuff, but it's basically
+inconsequential. (See previous note.) You can check using the file
+nmkun.txt (#### RENAME), which is just a list of all the routines that
+have been split. (It was generated from the output of `nmake
+unicode-encapsulate', after removing everything from the output but
+the function names.) Use something like
+fgrep -f ../nmkun.txt -w [a-hj-z]*.[ch] |m
+in the source directory, which does a word match and skips
+intl-unicode-win32.[ch] and intl-win32.[ch], which have a whole lot of
+references to these, unavoidably. It effectively detects what needs
+to be changed because changed versions either begin qxe... or end with
+A or W, and in each case there's no whole-word match.
+The nasty bug has been fixed below. The -nuni option now works -- all
+specially-written code to handle the encapsulation has been tested by
+some operation (fonts by loadup and checking the output of (list-fonts
+""); devmode by printing; dragdrop tests other stuff).
+NOTE: for -nuni (Win 95), areas need work:
+-- cut and paste. we should be able to receive Unicode text if it's
+ there, and we should be able to receive it even in Win 95 or -nuni.
+ we should just check in all circumstances. also, under 95, when we
+ put some text in the clipboard, it may or may not also be
+ automatically enumerated as unicode. we need to test this out
+ and/or just go ahead and manually do the unicode enumeration.
+-- receiving keyboard input. we get only a single byte, but we should
+ be able to correlate the language of the keyboard layout to a
+ particular code page, so we can then decode it correctly.
+-- mswindows-multibyte. still implemented as its own thing. should
+ be done as a chain of (encoding) unicode | unicode-to-multibyte.
+ need to turn this on, get it working, and look into optimizations
+ in the dfc stuff. (#### perhaps there's a general way to do these
+ optimizations??? something like having a method on a coding system
+ that can specify whether a pure-ASCII string gets rendered as
+ pure-ASCII bytes and vice-versa.)
+-- we have special macros TSTR_TO_C_STRING and such because formerly
+ the DFC macros didn't know about external stuff that was Unicode
+ encoded and would call strlen() on them. this is fixed, so now we
+ should undo the special macros, make em normal, removal the
+ comments about this, and make sure it works. [DONE]
+-- finally: working on the C-x in Russian key layout problem. in the
+ process will probably end up doing work on cleaning up the handling
+ of keyboard layouts, integrating or deleting the FSF stuff, adding
+ code to change the keyboard layout as we move in and out of text in
+ different languages (implemented as a post-command-hook; we need
+ something like internal-post-command-hook if not already there, for
+ internal stuff that doesn't want to get mixed up with the regular
+ post-command-hook; similar for pre-command-hook). also, when
+ langenv changes, ways to set the keyboard layout appropriately.
+-- i think the stuff above is higher priority than the other stuff
+ mentioned below. what i'm aiming for is to be able to input and
+ work with multiple languages without weird glitches, both under 95
+ and NT. the problems above are all basic impediments to such work.
+ we assume for the moment that the user can make use of the existing
+ file i/o conversion stuff, and put that lower in priority, after
+ the basic input is working.
+-- i should get my modem connected and write up what's going on and
+ send it to the lists; also cvs commit my workspaces and get more