oct 27, 2001:
-------- proposal for better buffer-switching commands:
implement what VC++ currently has. you have a single "switch" command like
CTRL-TAB, which as long as you hold the CTRL button down, brings successive
buffers that are "next in line" into the current position, bumping the rest
forward. once you release the CTRL key, the chain is broken, and further
CTRL-TABs will start from the beginning again. this way, frequently used
buffers naturally move toward the front of the chain, and you can switch
back and forth between two buffers using CTRL-TAB. the only thing about
CTRL-TAB is it's a bit awkward. the way to implement is to have
modifier-up strokes fire off a hook, like modifier-up-hook. this is driven
by event dispatch, so there are no synchronization issues. when C-tab is
pressed, the binding function does something like set a one-shot handler on
the modifier-up-hook (perhaps separate hooks for separate modifiers?).
to do this, we'd also want to change the buffer tabs so that they maintain
their own order. in particular, they start out synched to the regular
order, but as you make changes, you don't want the tabs to change
order. (in fact, they may already do this.) selecting a particular buffer
from the buffer tabs DOES make the buffer go to the head of the line. the
invariant is that if the tabs are displaying X items, those X items are the
first X items in the standard buffer list, but may be in a different
order. (it looks like the tabs may already implement all of this.)
oct 26, 2001:
- test all eol detection stuff under windows w/ and w/o mule, unix w/ and
w/o mule. (test configure flag, command-line flag, menu option) may need
a way of pretending to be unix under cygwin.
- test under windows w/ and w/o mule, cygwin w/ and w/o mule, cygwin x
windows w/ and w/o mule.
- test undecided-dos/unix/mac.
- check ESC ESC works as isearch-quit under TTY's.
- test coding-system-base and all its uses (grep for them).
- menu item to revert to most recent auto save.
- consider renaming build_string -> build_intstring and build_c_string to
build_string. (consistent with build_msg_string et al; many more
build_c_string than build_string)
oct 20, 2001:
fixed problem causing crash due to invalid internal-format data, fixed an
existing bug in valid_char_p, and added checks to more quickly catch when
invalid chars are generated. still need to investigate why
mswindows-multibyte is being detected.
i now see why -- we only process 65536 bytes due to a constant
MAX_BYTES_PROCESSED_FOR_DETECTION. instead, we should have no limit as
long as we have a seekable stream. we also need to write
stderr_out_lisp(), used in the debug info routines i wrote.
check once more about DEBUG_XEMACS. i think debugging info should be
ON by default. make sure it is. check that nothing untoward will result
in a production system, e.g. presumably assert()s should not really abort().
(!! Actually, this should be runtime settable! Use a variable for this, and
it can be set using the same XEMACSDEBUG method. In fact, now that I think
of it, I'm sure that debugging info should be on always, with runtime ways
of turning on or off any funny behavior.)
oct 19, 2001:
fixed various bugs preventing packages from being able to be built. still
another bug, with psgml/etc/cdtd/docbook, which contains some strange
characters starting around char pos 110,000. It gets detected as
mswindows-multibyte (wrong! why?) and then invalid internal-format data is
generated. need to fix mswindows-multibyte (and possibly add something
that signals an error as well; need to work on this error-signalling
mechanism) and figure out why it's getting detected as such. what i should
do is add a debug var that outputs blow-by-blow info of the detection
oct 9, 2001:
the stuff with global-window-system-map doesn't appear to work. in any
case it needs better documentation. [DONE]
M-home, M-end do work, but cause cl-macs to get loaded. why?
oct 8, 2001:
finished the coding system changes and they finally work!
need to implement undecided-unix/dos/mac. they should be easy to do; it
should be enough to specify an eol-type but not do-eol, but check this.
consider making the standard naming be foo-lf/crlf/cr, with unix/dos/mac as
print methods for coding systems should include some of the generic
properties. (also then fix print_..._within_print_method). [DONE]
in a little while, go back and delete the text-file-wrapper-coding-system
code. (it'll be in CVS if necessary to get at it.) [DONE]
need to verify at some point that non-text-file coding systems work
properly when specified. when gzip is working, this would be a good test
case. (and consider creating base64 as well!)
remove extra crap from coding-system-category that checks for chain coding
perhaps make a primitive that gets at coding-system-canonical. [DONE]
need to test cygwin, compiling the mule packages, get unix-eol stuff
working. frank from germany says he doesn't see a lisp backtrace when he
gets an error during temacs? verify that this actually gets outputted.
consider putting the current language on the modeline, mousable so it can
be switched. also consider making the coding system be mousable and the
line number (pick a line) and the percentage (pick a percentage).
oct 6, 2001:
added code so that debug_print() will output a newline to the mswindows
debugging output, not just the console. need to test. [DONE]
working on problem where all files are being detected as binary. the
problem may be that the undecided coding system is getting wrapped with an
auto-eol coding system, which it shouldn't be -- but even in this
situation, we should get the right results! check the
canonicalize-after-coding methods. also, determine_real_coding_system
appears to be getting called even when we're not detecting encoding. also,
undecided needs a print method to show its params, and chain needs to be
updated to show canonicalize_after_coding. check others as well. [DONE]
oct 5, 2001:
finished up coding system changes, testing.
errors byte-compiling files in iso-2022-7-bit. perhaps it's not correctly
detecting the encoding?
noticed a problem in the dfc macros: we call
get_coding_system_for_text_file with eol_wrap == 1, to allow for
auto-detection of the eol type; but this defeats the check and
short-circuit for unicode.
still need to implement calling determine_real_coding_system() for
non-seekable streams. to implement correctly, we need to do our own
buffering. [DONE, BUT WITHOUT BUFFERING]
oct 4, 2001:
implemented most stuff below.
need to finish up changes to make_coding_system_1. (i changed the way
internal coding systems were handled; i need to create subsidiaries for all
types of coding systems, not just text ones.) there's a nasty xfree() crash
i was hitting; perhaps it'll go away once all stuff has been rewritten.
check under cygwin to make sure that when an error occurs during loadup, a
backtrace is output.
as soon as andy releases his new setup, we should put it onto various
standard windows software repositories.
oct 3, 2001:
added global-tty-map and global-window-system-map. add some stuff to the
maps, e.g. C-x ESC for repeat vs. C-x ESC ESC on TTY's, and of course ESC
ESC on window systems vs. ESC ESC ESC on TTY's. [TEST]
was working on integrating the two help-for-tutorial versions (mule,
non-mule). [DONE, but test under non-Mule]
was working on the file-coding changes. need to think more about
text-file-wrapper. conclusion i think is that
get_coding_system_for_text_file should wrap using a special coding system
type called a text-file-wrapper, which inherits from chain, and implements
canonicalize-after-decoding to just return the unwrapped coding system. We
need to implement inheritance of coding systems, which will certainly come
in extremely useful when coding systems get implemented in Lisp, which
should happen at some point. (see existing docs about this.) essentially,
we have a way of declaring that we inherit from some system, and the
appropriate data structures get created, perhaps just an extra inheritance
pointer. but when we create the coding system, the extra data needs to be
a stretchy array of offsets, pointing to the type-specific data for the
coding system type and all its parents. that means that in the methods
structure for a coding system (which perhaps should be expanded beyond
method, it's just a "class structure") is the index in these arrays of
offsets. CODING_SYSTEM_DATA() can take any of the coding system classes
(rename type to class!) that make up this class. similarly, a coding
system class inherits its methods from the class above unless specifying
its own method, and can call the superclass method at any point by either
just invoking its name, or conceivably by some macro like
CALL_SUPER (method, (args))
similar mods would have to be made to coding stream structures.
perhaps for the immediate we can just sort of fake things like we currently
do with undecided calling some stuff from chain.
oct 2, 2001:
need to implement support for iso-8859-15, i.e. iso-8859-1 + euro symbol.
figure out how to fall back to iso-8859-1 as necessary.
leave the current bindings the way they are for the moment, but bump off
M-home and M-end (hardly used), and substitute my buffer movement stuff
there. [DONE, but test]
there's something to be said for combining block of 6 and paragraph,
esp. if we make the definition of "paragraph" be so that it skips by 6 when
within code. hmm.
eliminate advertised-undo crap, and similar hacks. [DONE]
think about obsolete stuff to be eliminated. think about eliminating or
dimming obsolete items from hyper-apropos and something similar in
sep 30, 2001:
synched up the tutorials with FSF 21.0.105. was rewriting them to favor
the cursor keys over the older C-p, etc. keys.
Got thinking about key bindings again.
(1) I think that M-up/down and M-C-up/down should be reversed. I use
scroll-up/down much more often than motion by paragraph.
(2) Should we eliminate move by block (of 6) and subsitute it for
paragraph? This would have the advantage that I could make bindings
for buffer change (forward/back buffer, perhaps M-C-up/down. with
shift, M-C-S-up/down only goes within the same type (C files, etc.).
alternatively, just bump off beginning-of-defun from C-M-home, since
it's on C-M-a already.
need someone to go over the other tutorials (five new ones, from FSF
21.0.105) and fix them up to correspond to the english one.
shouldn't shift-motion work with C-a and such as well as arrows?
sep 29, 2001:
charcount_to_bytecount can also be made to scream -- as can scan_buffer,
buffer_mule_signal_inserted_region, others? we should start profiling
though before going too far down this line.
Debug code that causes no slowdown should in general remain in the
executable even in the release version because it may be useful (e.g. for
people to see the event output). so DEBUG_XEMACS should be rethought.
things like use of msvcrtd.dll should be controlled by error_checking on.
maybe DEBUG_XEMACS controls general debug code (e.g. use of msvcrtd.dll,
asserts abort, error checking), and the actual debugging code should remain
always, or be conditonalized on something else
doc strings in dumped files are displayed with an extra blank line between
each line. presumably this is recent? i assume either the change to
detect-coding-region or the double-wrapping mentioned below.
error with coding-system-property on iso-2022-jp-dos. problem is that that
coding system is wrapped, so its type shows up as chain, not iso-2022.
this is a general problem, and i think the way to fix it is to in essence
do late canonicalization -- similar in spirit to what was done long ago,
canonicalize_when_code, except that the new coding system (the wrapper) is
created only once, either when the original cs is created or when first
needed. this way, operations on the coding system work like expected, and
you get the same results as currently when decoding/encoding. the only
thing tricky is handling canonicalize-after-coding and the ever-tricky
double-wrapping problem mentioned below. i think the proper solution is to
move the autodetection of eol into the main autodetect type. it can be
asked to autodetect eol, coding, or both. for just coding, it does like it
currently does. for just eol, it does similar to what it currently does
but runs the detection code that convert-eol currently does, and selects
the appropriate convert-eol system. when it does both eol and coding, it
does something on the order of creating two more autodetect coding systems,
one for eol only and one for coding only, and chains them together. when
each has detected the appropriate value, the results are combined. this
automatically eliminates the double-wrapping problem, removes the need for
complicated canonicalize-after-coding stuff in chain, and fixes the problem
of autodetect not having a seekable stream because hidden inside of a
chain. (we presume that in the both-eol-and-coding case, the various
autodetect coding streams can communicate with each other appropriately.)
also, we should solve the problem of internal coding systems floating
around and clogging up the list simply by having an "internal" property on
cs's and an internal param to coding-system-list (optional; if not given,
you don't get the internal ones). [DONE]
we should try to reduce the size of the from-unicode tables (the dominant
memory hog in the tables). one obvious thing is to not store a whole
emchar as the mapped-to value, but a short that encodes the octets. [DONE]
sep 28, 2001:
need to merge up to latest in trunk.
add unicode charsets for all non-translatable unicode chars; probably want
to extend the concept of charsets to allow for dimension 3 and dimension 4
charsets. for the moment we should stick with just dimension 3 charsets;
otherwise we run past the current maximum of 4 bytes per emchar. (most code
would work automatically since it uses MAX_EMCHAR_LEN; the trickiness is in
certain code that has intimate knowledge of the representation.
e.g. bufpos_to_bytind() has to multiply or divide by 1, 2, 3, or 4,
and has special ways of handling each number. with 5 or 6 bytes per char,
we'd have to change that code in various ways.) 96x96x96 = 884,000 or so,
so with two 96x96x96 charsets, we could tackle all Unicode values
representable by UTF-16 and then some -- and only these codepoints will
ever have assigned chars, as far as we know.
need an easy way of showing the current language environment. some menus
need to have the current one checked or whatever. [DONE]
implement unicode surrogates.
implement buffer-file-coding-system-when-loaded -- make sure find-file,
revert-file, etc. set the coding system [DONE]
verify all the menu stuff [DONE]
implemented the entirely-ascii check in buffers. not sure how much gain
it'll get us as we already have a known range inside of which is constant
time, and with pure-ascii files the known range spans the whole buffer.
improved the comment about how bufpos-to-bytind and vice-versa work. [DONE]
fix double-wrapping of convert-eol: when undecided converts itself to
something with a non-autodetect eol, it needs to tell the adjacent
convert-eol to reduce itself to nothing.
need menu item for find file with specified encoding. [DONE]
renamed coding systems mswindows-### to windows-### to follow the standard
in rfc1345. [DONE]
implemented coding-system-subsidiary-parent [DONE]
HAVE_MULE -> MULE in files in nt/ so that depend checking works [DONE]
need to take the smarter search-all-files-in-dir stuff from my sample init
file and put it on the grep menu [DONE]
added item for revert w/specified encoding; mostly works, but needs fixes.
in particular, you get the correct results, but buffer-file-coding-system
does not reflect things right. also, there are too many entries. need to
split into submenus. there is already split code out there; see if it's
generalized and if not make it so. it should only split when there's more
than a specified number, and when splitting, split into groups of a
specified size, not into a specified number of groups. [DONE]
too many entries in the langenv menus; need to split. [DONE]
sep 27, 2001:
NOTE: M-x grep for make-string causes crash now. something definitely to
do with string changes. check very carefully the diffs and put in those
sledgehammer checks. [DONE]
fix font-lock bug i introduced. [DONE]
added optimization to strings (keeps track of # of bytes of ascii at the
beginning of a string). perhaps should also keep an all-ascii flag to deal
with really large (> 2 MB) strings. rewrite code to count ascii-begin to
use the 4-or-8-at-a-time stuff in bytecount_to_charcount.
Error: M-q is causing Invalid Regexp error on the above paragraph. It's
not in working. I assume it's a side effect of the string stuff. VERIFY!
Write sledgehammer checks for strings. [DONE]
revamped the locale/init stuff so that it tries much harder to get things
right. should test a bit more. in particular, test out Describe Language
on the various created environments and make sure everything looks right.
should change the menus: move the submenus on Edit->Mule directly under
Edit. add a menu entry on File to say "Reload with specified encoding ->".
Also Find File with specified encoding -> Also entry to change the EOL
settings for Unix, and implement it.
decode-coding-region isn't working because it needs to insert a binary
(char->byte) converter. [DONE]
chain should be rearranged to be in decoding order; similar for
source/sink-type, other things?
the detector should check for a magic cookie even without a seekable input.
(currently its input is not seekable, because it's hidden within a chain.
#### See what we can do about this.)
provide a way to display various settings, e.g. the current category
mappings and priority (see mule-diag; get this working so it's in the
path); also a way to print out the likeliness results from a detection,
perhaps a debug flag.
problem with `env', which causes path issues due to `env' in packages.
move env code to process, sync with fsf 21.0.105, check that the autoloads
in `env' don't cause problems. [DONE]
8-bit iso2022 detection appears broken; or at least, mule-canna.c is not so
sep 25, 2001:
something else to do is review the font selection and fix it so that (e.g.)
JISX-0212 can be displayed.
also, text in widgets needs to be drawn by us so that the correct fonts
will be displayed even in multi-lingual text.
sep 24, 2001:
the detection system is now properly abstracted. the detectors have been
rewritten to include multiple levels of abstraction. now we just need
detectors for ascii, binary, and latin-x, as well as more sophisticated
detectors in general and further review of the general algorithm for doing
detection. (#### Is this written up anywhere?) after that, consider adding
error-checking to decoding (VERY IMPORTANT) and verifying the binary
correctness of things under unix no-mule.
sep 23, 2001:
began to fix the detection system -- adding multiple levels of likelihood
and properly abstracting the detectors. the system is in place except for
the abstraction of the detector-specific data out of the struct
detection_state. we should get things working first before tackling that
(which should not be too hard). i'm rewriting algorithms here rather than
just converting code, so it's harder. mostly done with everything, but i
need to review all detectors except iso2022 and make them properly follow
the new way. also write a no-conversion detector. also need to look into
the `recode' package and see how (if?) they handle detection, and maybe
copy some of the algorithms. also look at recent FSF 21.0 and see if their
algorithms have improved.
sep 22, 2001:
fixed gc bugs from yesterday.
fixed truename bug.
close/finalize stuff works.
eliminated notyet stuff in syswindows.h.
eliminated special code in tstr_to_c_string.
fixed pdump problems. (many of them, mostly latent bugs, ugh)
fixed cygwin sscanf problems in parse-unicode-translation-table. (NOT a
sscanf bug, but subtly different behavior w.r.t. whitespace in the format
string, combined with a debugger that sucks ROCKS!! and consistently
outputs garbage for variable values.)
main stuff to test is the handling of EOF recognition vs. binary
(i.e. check what the default settings are under Unix). then we may have
something that WORKS on all platforms!!! (Also need to test Windows
sep 21, 2001:
finished redoing the close/finalize stuff in the lstream code. but i
encountered again the nasty bug mentioned on sep 15 that disappeared on its
own then. the problem seems to be that the finalize method of some of the
lstreams is calling Lstream_delete(), which calls free_managed_lcrecord(),
which is a no-no when we're inside of garbage-collection and the object
passed to free_managed_lcrecord() is unmarked, and about to be released by
the gc mechanism -- the free lists will end up with xfree()d objects on
them, which is very bad. we need to modify free_managed_lcrecord() to
check if we're in gc and the object is unmarked, and ignore it rather than
move it to the free list. [DONE]
(#### What we really need to do is do what Java and C# do w.r.t. their
finalize methods: For objects with finalizers, when they're about to be
freed, leave them marked, run the finalizer, and set another bit on them
indicating that the finalizer has run. Next GC cycle, the objects will
again come up for freeing, and this time the sweeper notices that the
finalize method has already been called, and frees them for good (provided
that a finalize method didn't do something to make the object alive
sep 20, 2001:
redid the lstream code so there is only one coding stream. combined the
various doubled coding stream methods into one; i'm a little bit unsure of
this last part, though, as the results of combining the two together seem
unclean. got it to compile, but it crashes in loadup. need to go through
and rehash the close vs. finalize stuff, as the problem was stuff getting
freed too quickly, before the canonicalize-after-decoding was run. should
eliminate entirely CODING_STATE_END and use a different method (close
coding stream). rewrite to use these two. make sure they're called in the
right places. Lstream_close on a stream should *NOT* do finalizing.
finalize only on delete. [DONE]
in general i'd like to see the flags eliminated and converted to
bit-fields. also, rewriting the methods to take advantage of rejecting
should make it possible to eliminate much of the state in the various
methods, esp. including the flags. need to test this is working, though --
reduce the buffer size down very low and try files with only CRLF's in
them, with one offset by a byte from the other, and see if we correctly
still have the problem with incorrectly truenaming files.
sep 19, 2001:
bug reported: crash while closing lstreams.
the lstream/coding system close code needs revamping. we need to document
that order of closing lstreams is very important, and make sure we're
consistent. furthermore, chain and undecided lstreams need to close their
underneath lstreams when they receive the EOF signal (there may be data in
the underneath streams waiting to come out), not when they themselves are
(if only we had proper inheritance. i think in any case we should
simulate it for the chain coding stream -- write things in such a way that
undecided can use the chain coding stream and not have to duplicate
in general we need to carefully think through the closing process to make
sure everything always works correctly and in the right order. also check
very carefully to make sure there are no dangling pointers to deleted
objects floating around.
move the docs for the lstream functions to the functions themselves, not
the header files. document more carefully what exactly Lstream_delete()
means and how it's used, what the connections are between Lstream_close(),
Lstream_delete(), Lstream_flush(), lstream_finalize, etc. [DONE]
additional error-checking: consider deadbeefing the memory in objects
stored in lcrecord free lists; furthermore, consider whether lifo or fifo
is correct; under error-checking, we should perhaps be doing fifo, and
setting a minimum number of objects on the lists that's quite large so that
it's highly likely that any erroneous accesses to freed objects will go
into such deadbeefed memory and cause crashes. also, at the earliest
available opportunity, go through all freed memory and check for any
consistency failures (overwrites of the deadbeef), crashing if so. perhaps
we could have some sort of id for each block, to easier trace where the
offending block came from. (all of these ideas are present in the debug
system malloc from VC++, plus more stuff.) there's similar code i wrote
sitting somewhere (in free-hook.c? doesn't appear so. we need to delete the
blocking stuff out of there!). also look into using the debug system
malloc from VC++, which has lots of cool stuff in it. we even have the
sources. that means compiling under pdump, which would be a good idea
anyway. set it as the default. (but then, we need to remove the
requirement that Xpm be a DLL, which is extremely annoying. look into
test the windows code page coding systems recently created.
problems reading my mail files -- 1personal appears to hang, others come up
with lots of ^M's. investigate.
test the enum functions i just wrote, and finish them.
still pdump problems.
sep 18, 2001:
critical-quit broken sometime after aug 25.
-- fixed critical quit.
-- fixed process problems.
-- print routines work. (no routine for ccl, though)
-- can read and write unicode files, and they can still be read by some
-- defaults should come up correctly -- mswindows-multibyte is general.
still need to test matej's stuff.
seems ok with multibyte stuff but needs more testing.
sep 17, 2001:
!!!!! something broken with processes !!!!! cannot send mail anymore. must
sep 17, 2001:
on mon/wed nights, stop *BEFORE* 11pm. Otherwise i just start getting
woozy and can't concentrate.
just finished getting assorted fixups to the main branch committed, so it
will compile under C++ (Andy committed some code that broke C++ builds).
cup'd the code into the fixtypes workspace, updated the tags appropriately.
i've created the appropriate log message, sitting in fixtypes.txt in
/src/xemacs; perhaps it should go into a README. now i just have to build
on everything (it's currently building), verify it's ok, run patcher-mail,
my mule ws is also very close. need to:
-- test the new print routines.
-- test it can read and write unicode files, and they can still be read by
some other program.
-- try to see if unicode can be auto-detected properly.
-- test it can read and write multibyte files in a few different formats.
currently can't recognize them, but if you set the cs right, it should
-- examine the test files sent by matej and see if we can handle them.
sep 15, 2001:
more eol fixing. this stuff is utter crap.
currently we wrap coding systems with convert-eol-autodetect when we create
them in make_coding_system_1. i had a feeling that this would be a
problem, and indeed it is -- when autodetecting with `undecided', for
example, we end up with multiple layers of eol conversion. to avoid this,
we need to do the eol wrapping *ONLY* when we actually retrieve a coding
system in places such as insert-file-contents. these places are
insert-file-contents, load, process input, call-process-internal,
encode/decode/detect-coding-region, database input, ...
(later) it's fixed, and things basically work. NOTE: for some reason,
adding code to wrap coding systems with convert-eol-lf when eol-type == lf
results in crashing during garbage collection in some pretty obscure place
-- an lstream is free when it shouldn't be. this is a bad sign. i guess
something might be getting initialized too early?
we still need to fix the canonicalization-after-decoding code to avoid
problems with coding systems like `internal-7' showing up. basically, when
eol==lf is detected, nil should be returned, and the callers should handle
it appropriately, eliding when necessary. chain needs to recognize when
it's got only one (or even 0) items in the chain, and elide out the chain.
sep 11, 2001: the day that will live in infamy.
rewrite of sep 9 entry about formats:
when calling make-coding-system, the name can be a cons of (format1 .
format2), specifying that it decodes format1->format2 and encodes the other
way. if only one name is given, that is assumed to be format1, and the
other is either `external' or `internal' depending on the end type.
normally the user when decoding gives the decoding order in formats, but
can leave off the last one, `internal', which is assumed. a multichain
might look like gzip|multibyte|unicode, using the coding systems named
`gzip', `(unicode . multibyte)' and `unicode'. the way this actually works
is by searching for gzip->multibyte; if not found, look for gzip->external
or gzip->internal. (In general we automatically do conversion between
internal and external as necessary: thus gzip|crlf does the expected, and
maps to gzip->external, external->internal, crlf->internal, which when
fully specified would be gzip|external:external|internal:crlf|internal --
see below.) To forcibly fit together two converters that have explicitly
specified and incompatible names (say you have unicode->multibyte and
iso8859-1->ebcdic and you know that the multibyte and iso8859-1 in this
case are compatible), you can force-cast using :, like this:
ebcdic|iso8859-1:multibyte|unicode. (again, if you force-cast between
internal and external formats, the conversion happens automatically.)
sep 10, 2001:
moved the autodetection stuff (both codesys and eol) into particular coding
systems -- `undecided' and `convert-eol' (type == `autodetect'). needs
lots of work. still need to search through the rest of the code and find
any remaining auto-detect code and move it into the undecided coding
system. need to modify make-coding-system so that it spits out
auto-detecting versions of all text-file coding systems unless we say not
to. need eliminate entirely the EOF flag from both the stream info and the
coding system; have only the original-eof flag. in
coding_system_from_mask, need to check that the returned value is not of
type `undecided', falling back to no-conversion if so. also need to make
sure we wrap everything appropriate for text-files -- i removed the
wrapping on set-coding-category-list or whatever (need to check all those
files to make sure all wrapping is removed). need to review carefully the
new code in `undecided' to make sure it works are preserves the same logic
as previously. need to review the closing and rewinding behavior of chain
and undecided (same -- should really consolidate into helper routines, so
that any coding system can embed a chain in it) -- make sure the dynarr's
are getting their data flushed out as necessary, rewound/closed in the
right order, no missing steps, etc.
also split out mule stuff into mule-coding.c. work done on
configure/xemacs.mak/Makefiles not done yet. work on emacs.c/symsinit.h to
interface with the new init functions not done yet.
also put in a few declarations of the way i think the abstracted detection
stuff ought to go. DON'T WORK ON THIS MORE UNTIL THE REST IS DEALT WITH
AND WE HAVE A WORKING XEMACS AGAIN WITH ALL EOL ISSUES NAILED.
really need a version of cvs-mods that reports only the current directory.
WRITE THIS! use it to implement a better cvs-checkin.
sep 9, 2001:
implemented a gzip coding system. unfortunately, doesn't quite work right
because it doesn't handle the gzip headers -- it just reads and writes raw
zlib data. there's no function in the library to skip past the header, but
we do have some code out of the library that we can snarf that implements
header parsing. we need to snarf that, store it, and output it again at
the beginning when encoding. in the process, we should create a "get next
byte" macro that bails out when there are no more. using this, we set up a
nice way of doing most stuff statelessly -- if we have to bail, we reject
everything back to the sync point. also need to fix up the autodetection
of zlib in configure.in.
BIG problems with eol. finished up everything i thought i would need to
get eol stuff working, but no -- when you have mswindows-unicode, with its
eol set to autodetect, the detection routines themselves do the autodetect
(first), and fail (they report CR on CRLF because of the NULL byte between
the CR and the LF) since they're not looking at ascii data. with a chain
it's similarly bad. for mswindows-multibyte, for example, which is a chain
unicode->unicode-to-multibyte, autodetection happens inside of the chain,
both when unicode and unicode-to-multibyte are active. we could twiddle
around with the eol flags to try to deal with this, but it's gonna be a big
mess, which is exactly what we're trying to avoid. what we basically want
is to entirely rip out all EOL settings from either the coding system or
the stream (yes, there are two! one might saw autodetect, and then the
stream contains the actual detected value). instead, we simply create an
eol-autodetect coding system -- or rather, it's part of the convert-eol
coding system. convert-eol, type = autodetect, does autodetection the
first time it gets data sent to it to decode, and thereafter sets a stream
parameter indicating the actual eol type for this stream. this means that
all autodetect coding systems, as created by `make-coding-system', really
are chains with a convert-eol at the beginning. only subsidiary xxx-unix
has no wrapping at all. this should allow eof detection of gzip, unicode,
etc. for that matter, general autodetection should be entirely
encapsulated inside of the `autodetect' coding system, with no
eol-autodetection -- the chain becomes convert-eol (autodetect) ->
autodetect or perhaps backwards. the generic autodetect similarly has a
coding-system in its stream methods, and needs somehow or other to insert
the detected coding-system into the chain. either it contains a chain
inside of it (perhaps it *IS* a chain), or there's some magic involving
canonicalization-type switcherooing in the middle of a decode. either way,
once everything is good and done and we want to save the coding system so
it can be used later, we need to do another sort of canonicalization --
converting auto-detect-type coding systems into the detected systems.
again, a coding-system method, with some magic currently so that
subsidiaries get properly used rather than something that's new but
equivalent to subsidiaries. (#### perhaps we could use a hash table to
avoid recreating coding systems when not necessary. but that would require
that coding systems be immutable from external, and i'm not sure that's the
i really think, after all, that i should reverse the naming of everything
in chain and source-sink-type -- they should be decoding-centric. later
on, if/when we come up with the proper way to make it totally symmetrical,
we'll be fine whether before then we were encoding or decoding centric.
sep 9, 2001:
investigated eol parameter.
implemented handling in make-coding-system of eol-cr and eol-crlf.
fixed calls everywhere to Fget_coding_system / Ffind_coding_system to
reject non-char->byte coding systems.
still need to handle "query eol type using coding-system-property" so it
magically returns the right type by parsing the chain.
no work done on formats, as mentioned below. we should consider using :
instead of || to indicate casting.
early sep 9, 2001:
renamed some codesys properties: `list' in chain -> chain; `subtype' in
unicode -> type. everything compiles again and sort of works; some CRLF
problems that may resolve themselves when i finish the convert-eol stuff.
the stuff to create subsidiaries has been rewritten to use chains; but i
still need to investigate how the EOL type parameter is used. also, still
need to implement this: when a coding system is created, and its eol type
is not autodetect or lf, a chain needs to be created and returned. i think
that what needs to happen is that the eol type can only be set to
autodetect or lf; later on this should be changed to simply be either
autodetect or not (but that would require ripping out the eol converting
stuff in the various coding systems), and eventually we will do the work on
the detection mechanism so it can do chain detection; then we won't need an
eol autodetect setting at all. i think there's a way to query the eol type
of a coding system; this should check to see if the coding system is a
chain and there's a convert-eol at the front; if so, the eol type comes
from the type of the convert-eol.
also check out everywhere that Fget_coding_system or Ffind_coding_system is
called, and see whether anything but a char->byte system can be tolerated.
create a new function for all the places that only want char->byte,
something like get_coding_system_char_to_byte_only.
think about specifying formats in make-coding-system. perhaps the name can
be a cons of (format1, format2), specifying that it encodes
format1->format2 and decodes the other way. if only one name is given,
that is assumed to be format2, and the other is either `byte' or `char'
depending on the end type. normally the user when decoding gives the
decoding order in formats, but can leave off the last one, `char', which is
assumed. perhaps we should say `internal' instead of `char' and `external'
instead of byte. a multichain might look like gzip|multibyte|unicode,
using the coding systems named `gzip', `(unicode . multibyte)' and
`unicode'. we would have to allow something where one format is given only
as generic byte/char or internal/external to fit with any of the same
byte/char type. when forcibly fitting together two converters that have
explicitly specified and incompatible names (say you have
unicode->multibyte and iso8859-1->ebcdic and you know that the multibyte
and iso8859-1 in this case are compatible), you can force-cast using ||,
like this: ebcdic|iso8859-1||multibyte|unicode. this will also force
external->internal translation as necessary:
unicode|multibyte||crlf|internal does unicode->multibyte,
external->internal, crlf->internal. perhaps you'd need to put in the
internal translation, like this: unicode|multibyte|internal||crlf|internal,
which means unicode->multibyte, external->internal (multibyte is compatible
with external); force-cast to crlf format and convert crlf->internal.
even later: Sep 8, 2001:
chain doesn't need to set character mode, that happens automatically when
the coding systems are created. fixed chain to return correct source/sink
type for itself and to check the compatibility of source/sink types in its
chain. fixed decode/encode-coding-region to check the source and sink
types of the coding system performing the conversion and insert appropriate
byte->char/char->byte converters (aka "binary" coding system). fixed
set-coding-category-system to only accept the traditional
encode-char-to-byte types of coding systems.
still need to extend chain to specify the parameters mentioned below,
esp. "reverse". also need to extend the print mechanism for chain so it
prints out the chain. probably this should be general: have a new method
to return all properties, and output those properties. you could also
implement a read syntax for coding systems this way.
still need to implement convert-eol and finish up the rest of the eol stuff
later September 7, 2001: (more like Sep 8)
moved many Lisp_Coding_System * params to Lisp_Object. In general this is
the way to go, and if we ever implement a copying GC, we will never want to
be passing direct pointers around. With no error-checking, we lose no
cycles using Lisp_Objects in place of pointers -- the Lisp_Object itself is
nothing but a pointer, and so all the casts and "dereferences" boil down to
Clarified and cleaned up the "character mode" on streams, and documented
who (caller or object itself) has the right to be setting character mode on
a stream, depending on whether it's a read or write stream. changed
conversion_end_type method and enum source_sink_type to return
encoding-centric values, rather than decoding-centric. for the moment,
we're going to be entirely encoding-centric in everything; we can rethink
later. fixed coding systems so that the decode and encode methods are
guaranteed to receive only full characters, if that's the source type of
the data, as per conversion_end_type.
still need to fix the chain method so that it correctly sets the character
mode on all the lstreams in it and checks the source/sink types to be
compatible. also fix decode-coding-string and friends to put the
appropriate byte->character (i.e. no-conversion) coding systems on the ends
as necessary so that the final ends are both character. also add to chain
a parameter giving the ability to switch the direction of conversion of any
particular item in the chain (i.e. swap encoding and decoding). i think
what we really want to do is allow for arbitrary parameters to be put onto
a particular coding system in the chain, of which the only one so far is
swap-encode-decode. don't need too much codage here for that, but make the
September 7, 2001:
just added a return value from the decode and encode methods of a coding
system, so that some of the data can get rejected. fixed the calling
routines to handle this. need to investigate when and whether the coding
lstream is set to character mode, so that the decode/encode methods only
get whole characters. if not, we should do so, according to the source
type of these methods. also need to implement the convert_eol coding
system, and fix the subsidiary coding systems (and in general, any coding
system where the eol type is specified and is not LF) to be chains
after everything is working, need to remove eol handling from encode/decode
methods and eventually consider rewriting (simplifying) them given the
September 5, 2001:
-- need to organize this. get everything below into the TODO list.
CVS the TODO list frequently so i can delete old stuff. prioritize
-- move README.ben-mule... to STATUS.ben-mule...; use README for
intro, overview of what's new, what's broken, how to use the
-- need a global and local coding-category-precedence list, which get
-- finished the BOM support. also finished something not listed
below, expansion to the auto-generator of Unicode-encapsulation to
support bracketing code with #if ... #endif, for Cygwin and MINGW
problems, e.g. This is tested; appears to work.
-- need to add more multibyte coding systems now that we have various
properties to specify them. need to add DEFUN's for mac-code-page
and ebcdic-code-page for completeness. need to rethink the whole
way that the priority list works. it will continue to be total
junk until multiple levels of likeliness get implemented.
-- need to finish up the stuff about the various defaults. [need to
investigate more generally where all the different default values
are that control encoding. (there are six places or so.) need to
list them in make-coding-system docs and put pointers
elsewhere. [[[[#### what interface to specify that this default
should be unicode? a "Unicode" language environment seems too
drastic, as the language environment controls much more.]]]] even
skipping the Unicode stuff here, we need to survey and list the
variables that control coding page behavior and determine how they
need to be set for various possible scenarios:
-- total binary: no detection at all.
-- raw-text only: wants only autodetection of line endings, nothing else.
-- "standard Windows environment": tries for Unicode, falls back on
code page encoding.
-- some sort of East European environment, and Russian.
-- some sort of standard Japanese Windows environment.
-- standard Chinese Windows environments (traditional and simplified)
-- various Unix environments (European, Japanese, Russian, etc.)
-- Unicode support in all of these when it's reasonable
These really require multiple likelihood levels to be fully
implementable. We should see what can be done ("gracefully fall
back") with single likelihood level. need lots of testing.
-- need to fix the truename problem.
-- lots of testing: need to test all of the stuff above and below that's recently been implemented.
September 4, 2001:
mostly everything compiles. currently there is a crash in
parse-unicode-translation-table, and Cygwin/Mule won't run. it may
well be a bug in the sscanf() in Cygwin.
working on today:
-- adding BOM support for Unicode coding systems. mostly there, but
need to finish adding BOM support to the detection routines. then test.
-- adding properties to unicode-to-multibyte to specify the coding
system in various flexible ways, e.g. directly specified code page
or ansi or oem code page of specified locale, current locale,
user-default or system-default locale. need to test.
-- creating a `multibyte' coding system, with the same parameters as
unicode-to-multibyte and which resolves at coding-system-creation
time to the appropriate chain. creating the underlying mechanism
to allow such under-the-scenes switcheroo. need to test.
-- set default-value of buffer-file-coding-system to
mswindows-multibyte, as Matej said it should be. need to test.
need to investigate more generally where all the different default
values are that control encoding. (there are six places or so.)
need to list them in make-coding-system docs and put pointers
elsewhere. #### what interface to specify that this default should
be unicode? a "Unicode" language environment seems too drastic, as
the language environment controls much more.
-- thinking about adding multiple levels of certainty to the detection
schemes, instead of just a mask. eventually, we need to totally
abstract things, but that can easier be done in many steps. (we
need multiple levels of likelihood to more reasonably support a
Windows environment with code-page type files. currently, in order
to get them detected, we have to put them first, because they can
look like lots of other things; but then, other encodings don't get
detected. with multiple levels of likelihood, we still put the
code-page categories first, but they will return low levels of
likelihood. Lower-down encodings may be able to return higher
levels of likelihood, and will get taken preferentially.)
-- making it so you cannot disable file-coding, but you get an
equivalent default on Unix non-Mule systems where all defaults are
`binary'. need to test!!!!!!!!!
Matej (mostly, + some others) notes the following problems, and here
are possible solutions:
-- he wants the defaults to work right. [figure out what those
defaults are. i presume they are auto-detection of data in current
code page and in unicode, and new files have current code page set
as their output encoding.]
-- too easy to lose data with incorrect encodings. [need to set up an
error system for encoding/decoding. extremely important but a
little tricky to implement so let's deal with other issues now.]
-- EOL isn't always detected correctly. [#### ?? need examples]
-- truename isn't working: c:\t.txt and c:\tmp.txt have the same truename.
[should be easy to fix]
-- unicode files lose the BOM mark. [working on this]
-- command-line utilities use OEM. [actually it seems more
complicated. it seems they use the codepage of the console. we
may be able to set that, e.g. to UTF8, before we invoke a command.
need to investigate.]
-- no way to handle unicode characters not recognized as charsets. [we
need to create something like 8 private 2-dimensional charsets to
handle all BMP Unicode chars. Obviously this is a stopgap
solution. Switching to Unicode internal will ultimately make life
far easier and remove the BMP limitation. but for now it will
work. we translate all characters where we have charsets into
chars in those charsets, and the remainder in a unicode charset.
that way we can save them out again and guarantee no data loss with
unicode. this creates font problems, though ...]
-- problems with xemacs font handling. [xemacs font handling is not
sophisticated enough. it goes on a charset granularity basis and
only looks for a font whose name contains the corresponding windows
charset in it. with unicode this fails in various ways. for one
the granularity needs to be single character, so that those unicode
charsets mentioned above work; and it needs to query the font to
see what unicode ranges it supports, rather than just looking at
the charset ending.]
August 28, 2001:
working on getting everything to compile again: Cygwin, non-MULE,
pdump. not there yet.
mswindows-multibyte is now defined using chain, and works. removed
most vestiges of the mswindows-multibyte coding system type.
file-coding is on by default; should default to binary only on Unix.
Need to test. (Needs to compile first :-)
August 26, 2001:
I've fixed the issue of inputting non-ASCII text under -nuni, and done
some of the work on the Russian C-x problem -- we now compute the
other possibilities. We still need to fix the key-lookup code,
though, and that code is unfortunately a bit ugly. the best way, it
seems, is to expand the command-builder structure so you can specify
different interpretations for keys. (if we do find an alternative
binding, though, we need to mess with both the command builder and
this-command-keys, as does the function-key stuff. probably need to
abstract that munging code.)
-- support for WM_IME_CHAR. IME input can work under -nuni if we use
WM_IME_CHAR. probably we should always be using this, instead of
snarfing input using WM_COMPOSITION. i'll check this out.
-- Russian C-x problem. see above.
-- make sure it compiles and runs under non-mule. remember that some
code needs the unicode support, or at least a simple version of it.
-- make sure it compiles and runs under pdump. see below.
-- clean up mswindows-multibyte, TSTR_TO_C_STRING. see below. [DONE]
-- eliminate last vestiges of codepage<->charset conversion and similar stuff.
-- cut and paste. see below.
-- misc issues with handling lang environments. see also August 25,
"finally: working on the C-x in ...".
-- when switching lang env, needs to set keyboard layout.
-- user var to control whether, when moving into text of a
particular language, we set the appropriate keyboard layout. we
would need to have a lisp api for retrieving and setting the
keyboard layout, set text properties to indicate the layout of
text, and have a way of dealing with text with no property on
it. (e.g. saved text has no text properties on it.) basically,
we need to get a keyboard layout from a charset; getting a
language would do. Perhaps we need a table that maps charsets
to language environments.
-- test that the lang env is properly set at startup. test that
switching the lang env properly sets the C locale (call
setlocale(), set LANG, etc.) -- a spawned subprogram should have
the new locale in its environment.
-- look through everything below and see if anything is missed in this
priority list, and if so add it. create a separate file for the
priority list, so it can be updated as appropriate.
-- clean up the chain coding system. its list should specify decode
order, not encode; i now think this way is more logical. it should
check the endpoints to make sure they make sense. it should also
allow for the specification of "reverse-direction coding systems":
use the specified coding system, but invert the sense of decode and
-- along with that, places that take an arbitrary coding system and
expect the ends to be anything specific need to check this, and add
the appropriate conversions from byte->char or char->byte.
-- get some support for arabic, thai, vietnamese, japanese jisx 0212:
at least get the unicode information in place and make sure we have
things tied together so that we can display them. worry about r2l
some other time.
August 25, 2001:
There is actually more non-Unicode-ized stuff, but it's basically
inconsequential. (See previous note.) You can check using the file
nmkun.txt (#### RENAME), which is just a list of all the routines that
have been split. (It was generated from the output of `nmake
unicode-encapsulate', after removing everything from the output but
the function names.) Use something like
fgrep -f ../nmkun.txt -w [a-hj-z]*.[ch] |m
in the source directory, which does a word match and skips
intl-unicode-win32.[ch] and intl-win32.[ch], which have a whole lot of
references to these, unavoidably. It effectively detects what needs
to be changed because changed versions either begin qxe... or end with
A or W, and in each case there's no whole-word match.
The nasty bug has been fixed below. The -nuni option now works -- all
specially-written code to handle the encapsulation has been tested by
some operation (fonts by loadup and checking the output of (list-fonts
""); devmode by printing; dragdrop tests other stuff).
NOTE: for -nuni (Win 95), areas need work:
-- cut and paste. we should be able to receive Unicode text if it's
there, and we should be able to receive it even in Win 95 or -nuni.
we should just check in all circumstances. also, under 95, when we
put some text in the clipboard, it may or may not also be
automatically enumerated as unicode. we need to test this out
and/or just go ahead and manually do the unicode enumeration.
-- receiving keyboard input. we get only a single byte, but we should
be able to correlate the language of the keyboard layout to a
particular code page, so we can then decode it correctly.
-- mswindows-multibyte. still implemented as its own thing. should
be done as a chain of (encoding) unicode | unicode-to-multibyte.
need to turn this on, get it working, and look into optimizations
in the dfc stuff. (#### perhaps there's a general way to do these
optimizations??? something like having a method on a coding system
that can specify whether a pure-ASCII string gets rendered as
pure-ASCII bytes and vice-versa.)
-- we have special macros TSTR_TO_C_STRING and such because formerly
the DFC macros didn't know about external stuff that was Unicode
encoded and would call strlen() on them. this is fixed, so now we
should undo the special macros, make em normal, removal the
comments about this, and make sure it works. [DONE]
-- finally: working on the C-x in Russian key layout problem. in the
process will probably end up doing work on cleaning up the handling
of keyboard layouts, integrating or deleting the FSF stuff, adding
code to change the keyboard layout as we move in and out of text in
different languages (implemented as a post-command-hook; we need
something like internal-post-command-hook if not already there, for
internal stuff that doesn't want to get mixed up with the regular
post-command-hook; similar for pre-command-hook). also, when
langenv changes, ways to set the keyboard layout appropriately.
-- i think the stuff above is higher priority than the other stuff
mentioned below. what i'm aiming for is to be able to input and
work with multiple languages without weird glitches, both under 95
and NT. the problems above are all basic impediments to such work.
we assume for the moment that the user can make use of the existing
file i/o conversion stuff, and put that lower in priority, after
the basic input is working.
-- i should get my modem connected and write up what's going on and
send it to the lists; also cvs commit my workspaces and get more
August 24, 2001:
All code has been Unicode-ized except for some stuff in console-msw.c
that deals with console output. Much of the Unicode-encapsulation
stuff, particularly the hand-written stuff, really needs testing. I
added a new command-line option, -nuni, to force use of all ANSI calls
-- XE_UNICODEP evaluates to false in this case.
There is a nasty bug that appeared recently, probably when the event
code got Unicode-ized -- bad interactions with OS sticky modifiers.
Hold the shift key down and release it, then instead of affecting the
next char only, it gets permanently stuck on (until you do a regular
shift+char stroke). This needs to be debugged.
Other things on agenda:
-- go through and prioritize what's listed below.
-- make sure the pdump code can compile and work. for the moment we
just don't try to dump any Unicode tables and load them up each
time. this is certainly fast but ...
-- there's the problem that XEmacs can't be run in a directory with
non-ASCII/Latin-1 chars in it, since it will be doing Unicode
processing before we've had a chance to load the tables. In fact,
even finding the tables in such a situation is problematic using
the normal commands. my idea is to eventually load the stuff
extremely extremely early, at the same time as the pdump data gets
loaded. in fact, the unicode table data (stored in an efficient
binary format) can even be stuck into the pdump file (which would
mean as a resource to the executable, for windows). we'd need to
extend pdump a bit: to allow for attaching extra data to the pdump
file. (something like pdump_attach_extra_data (addr, length)
returns a number of some sort, an index into the file, which you
can then retrieve with pdump_load_extra_data(), which returns an
addr (mmap()ed or loaded), and later you pdump_unload_extra_data()
when finished. we'd probably also need
pdump_attach_extra_data_append(), which appends data to the data
just written out with pdump_attach_extra_data(). this way,
multiple tables in memory can be written out into one contiguous
table. (we'd use the tar-like trick of allowing new blocks to be
written without going back to change the old blocks -- we just rely
on the end of file/end of memory.) this same mechanism could be
extracted out of pdump and used to handle the non-pdump situation
(or alternatively, we could just dump either the memory image of
the tables themselves or the compressed binary version). in the
case of extra unicode tables not known about at compile time that
get loaded before dumping, we either just dump them into the image
(pdump and all) or extract them into the compressed binary format,
free the original tables, and treat them like all other tables.
-- `C-x b' when using a Russian keyboard layout. XEmacs currently
tries to interpret C+cyrillic char, which causes an error. We want
C-x b to still work even when the keyboard normally generates
Cyrillic. What we should do is expand the keyboard event structure
so that it contains not only the actual char, but what the char
would have been in various other keyboard layouts, and in contexts
where only certain keystrokes make sense (creating control chars,
and looking up in keymaps), we proceed in order, processing each of
them until we get something. order should be something like:
current keyboard layout; layout of the current language
environment; layout of the user's default language; layout of the
system default language; layout of US English.
-- reading and writing Unicode files. multiple problems:
-- EOL's aren't handled right. for the moment, just fix the
Unicode coding systems; later on, create EOL-only coding
1. they would be character->character and operate next to the
internal data; this means that coding systems need to be able
to handle ends of lines that are either CR, LF, or CRLF.
usually this isn't a problem, as they are just characters
like any other and get encoded appropriately. however,
coding systems that are line-oriented need to recognize any
of the three as line endings.
2. we'd also have to complete the stuff that handles coding
systems where either end can be byte or char (four
possibilities total; use a single enum such as
ENCODES_CHAR_TO_BYTE, ENCODES_BYTE_TO_BYTE, etc.).
3. we'd need ways of specifying the chaining of coding systems.
e.g. when reading a coding system, a user can specify more
than one with a | symbol between them. when a context calls
for a coding system and a chain is needed, the `chain' coding
system is useful; but we should really expand the contexts
where a list of coding systems can be given, and whenever
possible try to inline the chain instead of using a
surrounding `chain' coding system.
4. the `chain' needs some work so that it passes all sorts of
lstream commands down to the chain inside it -- it should be
entirely transparent and the fact that there's actually a
surrounding coding system should be invisible. more general
coding system methods might need to be created.
5. important: we need a way of specifying how detecting works
when we have more than one coding system. we might need more
than a single priority list. need to think about this.
-- Unicode files beginning with the BOM are not recognized as such.
we need to fix this; but to make things sensible, we really need
to add the idea of different levels of confidence regarding
what's detected. otherwise, Unicode says "yes this is me" but
others higher up do too. in the process we should probably
finish abstracting the detection system and fix up some
stupidities in it.
-- When writing a file, we need error detection; otherwise somebody
will create a Unicode file without realizing the coding system
of the buffer is Raw, and then lose all the non-ASCII/Latin-1
text when it's written out. We need two levels
1. first, a "safe-charset" level that checks before any actual
encoding to see if all characters in the document can safely
be represented using the given coding system. FSF has a
"safe-charset" property of coding systems, but it's stupid
because this information can be automatically derived from
the coding system, at least the vast majority of the time.
What we need is some sort of
where everything on it can be checked for safe charsets and
then the user given a list of possibilities. When the user
does "save with specified encoding", they should see the same
precedence list. Again like with other precedence lists,
there's also a global one, and presumably all coding systems
not on other list get appended to the end (and perhaps not
checked at all when doing safe-checking?). safe-checking
should work something like this: compile a list of all
charsets used in the buffer, along with a count of chars
used. that way, "slightly unsafe" charsets can perhaps be
presented at the end, which will lose only a few characters
and are perhaps what the users were looking for.
2. when actually writing out, we need error checking in case an
individual char in a charset can't be written even though the
charsets are safe. again, the user gets the choice of other
reasonable coding systems.
3. same thing (error checking, list of alternatives, etc.) needs
to happen when reading! all of this will be a lot of work!
Announcement, August 20, 2001:
I'm looking for testers. There is a complete and fast implementation
in C of Unicode conversion, translations for almost all of the
standardly-defined charsets that load up automatically and
instantaneously at runtime, coding systems supporting the common
external representations of Unicode [utf-16, ucs-4, utf-8,
little-endian versions of utf-16 and ucs-4; utf-7 is sitting there
with aborts where the coding routines should go, just waiting for
somebody to implement], and a nice set of primitives for translating
characters<->codepoints and setting the priority lists used to control
It's so far hooked into one place: the Windows IME. Currently I can
select the Japanese IME from the thing on my tray pad in the lower
right corner of the screen, and type Japanese into XEmacs, and you get
Japanese in XEmacs -- regardless of whether you set either your
current or global system locale to Japanese,and regardless of whether
you set your XEmacs lang env as Japanese. This should work for many
other languages, too -- Cyrillic, Chinese either Traditional or
Simplified, and many others, but YMMV. There may be some lurking
bugs (hardly surprising for something so raw).
To get at this, checkout using `ben-mule-21-5', NOT the simpler
*`mule-21-5'. For example
cvs -d :pserver:firstname.lastname@example.org:/usr/CVSroot checkout -r ben-mule-21-5 xemacs
or you get the idea. the `-r ben-mule-21-5' is important.
I keep track of my progress in a file called README.ben-mule-21-5 in
the root directory of the source tree.
WARNING: Pdump might not work. Will be fixed rsn.
August 20, 2001:
-- still need to sort out demand loading, binary format, etc. figure
out what the goals are and how we're going to achieve them. for
the moment let's just say that running XEmacs in a directory with
Japanese or other weird characters in the name is likely to cause
problems under MS Windows, but once XEmacs is initialized (and
before processing init files), all Unicode support is there.
-- wrote the size computation routines, although not yet tested.
-- lots more abstraction of coding systems; almost done.
-- UNICODE WORKS!!!!!
August 19, 2001:
Still needed on the Unicode support:
-- demand loading: load the Unicode table data the first time a
conversion needs to be done.
-- maybe: table size computation: figure out how big the in-memory
tables actually are.
-- maybe: create a space-efficient binary format for the data, and a
way to dump out an existing charset's data into this binary format.
it should allow for many such groups of data to be appended
together in one file, such that you can just append the new data
onto the end and not have to go back and modify anything
previously. (like how tar archives work, and how the UFS? for
CD-R's and CD-RW's works.)
-- maybe: figure out how to be able to access the Unicode tables at
init_intl() time, before we know how to get at data-directory; that
way we can handle the need for unicode conversions that come up
very early, for example if XEmacs is run from a directory
containing Japanese in it. Presumably we'd want to generalize the
stuff in pdump.c that deals with the dumper file, so that it can
handle other files -- putting the file either in the directory of
the executable or in a resource, maybe actually attached to the
pdump file itself -- or maybe we just dump the data into the actual
executable. With pdump we could extend pdump to allow for data
that's in the pdump file but not actually mapped at startup,
separate from the data that does get mapped -- and then at runtime
the pointer gets restored not with a real pointer but an offset
into the file; another pdump call and we get some way to access the
data. (tricky because it might be in a resource, not a file. we
might have to just tell pdump to mmap or whatever the data in, and
then tell pdump to release it.)
-- fix multibyte to use unicode. at first, just reverse
mswindows-multibyte-to-unicode to be unicode-to-multibyte; later
implement something in chain to allow for reversal, for declaring
the ends of the coding systems, etc.
-- actually make sure that the IME stuff is working!!!
Other things before announcing:
-- change so that the Unicode tables are not pdumped. This means we
need to free any table data out there. Make sure that pdump
compiles and try to finish the pretty-much-already-done stuff
already with XD_STRUCT_ARRAY and dynamic size computation; just
need to see what's going on with LO_LINK.
August 14, 2001:
To do a diff between this workspace and the mainline, use the most recent sync tags, currently:
cvs diff -r main-branch-ben-mule-21-5-aug-11-2001-sync -r ben-mule-21-5-post-aug-11-2001-sync
Unicode support is important for supporting many languages under
Windows, such as Cyrillic, without resorting to translation tables for
particular Windows-specific code pages. Internally, all characters in
Windows can be represented in two encodings: code pages and Unicode.
With Unicode support, we can seamlessly support all Windows
characters. Currently, the test in the drive to support Unicode is if
IME input works properly, since it is being converted from Unicode.
Unicode support also requires that the various Windows API's be
"Unicode-encapsulated", so that they automatically call the ANSI or
Unicode version of the API call appropriately and handle the size
differences in structures. What this means is:
-- first, note that Windows already provides a sort of encapsulation
of all API's that deal with text. All such API's are underlyingly
provided in two versions, with an A or W suffix (ANSI or "wide"
i.e. Unicode), and the compile-time constant UNICODE controls which
is selected by the unsuffixed API. Same thing happens with
structures. Unfortunately, this is compile-time only, not
run-time, so not sufficient. (Creating the necessary run-time
encoding is not conceptually difficult, but very time-consuming to
write. It adds no significant overhead, and the only reason it's
not standard in Windows is conscious marketing attempts by
Microsoft to cripple Windows 95. FUCK MICROSOFT! They even
describe in a KnowledgeBase article exactly how to create such an
API [although we don't exactly follow their procedure], and point
out its usefulness; the procedure is also described more generally
in Nadine Kano's book on Win32 internationalization -- written SIX
YEARS AGO! Obviously Microsoft has such an API available
-- what we do is provide an encapsulation of each standard Windows API
call that is split into A and W versions. current theory is to
avoid all preprocessor games; so we name the function with a prefix
-- "qxe" currently -- and require callers to use the prefixed name.
Callers need to explicitly use the W version of all structures, and
convert text themselves using Qmswindows_tstr. the qxe
encapsulated version will automatically call the appropriate A or W
version depending on whether we're running on 9x or NT, and copy
data between W and A versions of the structures as necessary.
-- We require the caller to handle the actual translation of text to
avoid possible overflow when dealing with fixed-size Windows
structures. There are no such problems when copying data between
the A and W versions because ANSI text is never larger than its
equivalent Unicode representation.
-- We allow for incremental creation of the encapsulated routines by
using the coding system Qmswindows_tstr_notyet. This is an alias
for Qmswindows_multibyte, i.e. it always converts to ANSI; but it
indicates that it will be changed to Qmswindows_tstr when we have a
qxe version of the API call that the data is being passed to and
change the code to use the new function.
Besides creating the encapsulation, the following needs to be done for
-- No actual translation tables are fed into XEmacs. We need to
provide glue code to read the tables in etc/unicode. See
etc/unicode/README for the interface to implement.
-- Fix pdump. The translation tables for Unicode characters function
as unions of structures with different numbers of indirection
levels, in order to be efficient. pdump doesn't yet support such
unions. charset.h has a general description of how the translation
tables work, and the pdump code has constants added for the new
required data types, and descriptions of how these should work.
-- ultimately, there's no end to additional work (composition, bidi
reordering, glyph shaping/ordering, etc.), but the above is enough
to get basic translation working.
Merging this workspace into the trunk requires some work. ChangeLogs
have not yet been created. Also, there is a lot of additional code in
this workspace other than just Windows and Unicode stuff. Some of the
changes have been somewhat disruptive to the code base, in particular:
-- the code that handles the details of processing multilingual text
has been consolidated to make it easier to extend it. it has been
yanked out of various files (buffer.h, mule-charset.h, lisp.h,
insdel.c, fns.c, file-coding.c, etc.) and put into text.c and
text.h. mule-charset.h has also been renamed charset.h. all long
comments concerning the representations and their processing have
been consolidated into text.c.
-- nt/config.h has been eliminated and everything in it merged into
config.h.in and s/windowsnt.h. see config.h.in for more info.
-- s/windowsnt.h has been completely rewritten, and s/cygwin32.h and
s/mingw32.h have been largely rewritten. tons of dead weight has
been removed, and stuff common to more than one file has been
isolated into s/win32-common.h and s/win32-native.h, similar to
what's already done for usg variants.
-- large amounts of code throughout the code base have been Mule-ized,
not just Windows code.
-- file-coding.c/.h have been largely rewritten (although still mostly
syncable); see below.
June 26, 2001:
this contains all the mule work i've been doing. this includes mostly
work done to get mule working under ms windows, but in the process
i've [of course] fixed a whole lot of other things as well, mostly
mule issues. the specifics:
- it compiles and runs under windows and should basically work. the
stuff remaining to do is (a) improved unicode support (see below)
and (b) smarter handling of keyboard layouts. in particular, it
should (1) set the right keyboard layout when you change your
language environment; (2) optionally (a user var) set the
appropriate keyboard layout as you move the cursor into text in a
- i added a bunch of code to better support OS locales. it tries to
notice your locale at startup and set the language environment
accordingly (this more or less works), and call setlocale() and set
LANG when you change the language environment (may or may not work).
- major rewriting of file-coding. it's mostly abstracted into coding
systems that are defined by methods (similar to devices and
specifiers), with the ultimate aim being to allow non-i18n coding
systems such as gzip. there is a "chain" coding system that allows
multiple coding systems to be chained together. (it doesn't yet
have the concept that either end of a coding system can be bytes or
chars; this needs to be added.)
- unicode support. very raw. a few days ago i wrote a complete and
efficient implementation of unicode translation. it should be very
fast, and fairly memory-efficient in its tables. it allows for
charset priority lists, which should be language-environment
specific (but i haven't yet written the glue code). it works in
preliminary testing, but obviously needs more testing and work.
as of yet there is no translation data added for the standard charsets.
the tables are in etc/unicode, and all we need is a bit of glue code
to process them. see etc/unicode/README for the interface to
- support for unicode in windows is partly there. this will work even
on windows 95. the basic model is implemented but it needs finishing
- there is a preliminary implementation of windows ime support courtesy
- if you want to get cyrillic working under windows (it appears to "work"
but the wrong chars currently appear), the best way is to add unicode
support for iso-8859-5 and use it in redisplay-msw.c. we are already
passing unicode codepoints to the text-draw routine (ExtTextOutW).
(ExtTextOutW and GetTextExtentPoint32W are implemented on both 95 and NT.)
- i fixed the iso2022 handling so it will correctly read in files
containing unknown charsets, creating a "temporary" charset which
can later be overwritten by the real charset when it's defined.
this allows iso2022 elisp files with literals in strange languages
to compile correctly under mule. i also added a hack that will
correctly read in and write out the emacs-specific "composition"
escape sequences, i.e. ESC 0 through ESC 4. this means that my
workspace correctly compiles the new file devanagari.el that i added
- i copied the remaining language-specific files from fsf. i made
some minor changes in certain cases but for the most part the stuff
was just copied and may not work.
- i fixed post-read-conversion in coding systems to follow fsf
conventions. (i also support our convention, for the moment. a
kludge, of course.)
- make-coding-system accepts (but ignores) the additional properties
present in the fsf version, for compatibility.