183 lines
7.2 KiB
Plaintext
183 lines
7.2 KiB
Plaintext
MIGRATION FROM 0.1.X TO 0.2.X
|
|
|
|
0.2.x breaks 0.1.x interoperability in many ways, to allow more use cases, and
|
|
to provide more storage capacity.
|
|
|
|
1. Binary Data Changes
|
|
|
|
1.1 All Trie Data in Single File
|
|
|
|
No more splitting of a trie into '{trie-name}.sbm', '{trie-name}.br' and
|
|
'{trie-name}.tl'. All parts are now stored in a single file, '{trie-name}.tri'.
|
|
|
|
Note, however, that a '{trie-name}.abm' (a renamed version of '{trie-name}.sbm'
|
|
after Unicode support) is still needed on first creation. But once created,
|
|
the '{trie-name}.tri' will incorporate the alphabet map data, and no
|
|
'{trie-name}.abm' is required in later uses. It will even be ignored if exists.
|
|
|
|
1.2 32-Bit Node Index
|
|
|
|
To accommodate larger word lists, trie node indices are now 32 bits, instead of
|
|
16 bits. This means 32,767 times capacity compared to the old format.
|
|
Therefore, the data size are doubled in general when migrating from old format,
|
|
but it can now hold exponentially more entries.
|
|
|
|
In addition, the tail block lengths are now 16 bits, instead of 8 bits, making
|
|
it possible to store longer suffixes, for dictionaries of extremely long words.
|
|
|
|
1.3 No Backward Compatibility
|
|
|
|
For simplicity of the code, it was decided not to read/write old format files.
|
|
If you still prefer using the old format, just stay with the old version. If
|
|
you like to gain more support from the new version, you can migrate your old
|
|
data by first dumping your dictionary with 0.1.x trietool into text file and
|
|
then creating the new dictionary with the dumped word list. Or if you already
|
|
have the word list, that makes things a lot easier. Just create the dictionary
|
|
with the new trietool.
|
|
|
|
Data Migration Steps:
|
|
|
|
a. If you have the word list source, just skip to next step. Otherwise, you
|
|
can dump the old data with 0.1.x trietool:
|
|
|
|
$ trietool {trie-name} list > words.lst
|
|
|
|
b. Prepare '{trie-name}.abm', listing ranges of characters used in the word
|
|
list, in terms of Unicode values. For example, for an English and Thai
|
|
dictionary:
|
|
|
|
[0x0041,0x005a]
|
|
[0x0061,0x007a]
|
|
[0x0e01,0x0e3a]
|
|
[0x0e40,0x0e4e]
|
|
|
|
c. Generate new trie with 0.2.x trietool-0.2. For example:
|
|
|
|
$ trietool-0.2 {trie-name} add-list -e TIS-620 words.lst
|
|
|
|
In this example, the '-e TIS-620' indicates that the 'words.lst' file
|
|
contains TIS-620 encoded text, which is most likely for word lists dumped
|
|
from the old trie with 8-bit Thai character code as the key encoding.
|
|
Replace it with your old encoding as necessary, such as ISO-8859-1 or the
|
|
like. If '-e' option is omitted, current locale encoding is assumed.
|
|
See trietool-0.2 man page for details.
|
|
|
|
2. API Changes
|
|
|
|
2.1 Non-File Trie Usage
|
|
|
|
In datrie 0.1.x, every trie was associated with a set of files. Now, this is
|
|
not only reduced to a single file, but zero file is also possible. That is, a
|
|
new trie can be created in memory, added words, removed words, queried words,
|
|
and then disposed without writing data to any file. Meanwhile, saving to file
|
|
is still possible.
|
|
|
|
Scenario 1: Loading trie from file, using it read-only.
|
|
1a. Open trie with trie_new_from_file(path).
|
|
1b. Use it.
|
|
1c. On exit:
|
|
- Close it with trie_free().
|
|
|
|
Scenario 2: Loading trie from file, updating file when finished.
|
|
2a. Open trie with trie_new_from_file(path).
|
|
2b. Use/update it.
|
|
2c. On exit:
|
|
- If trie_is_dirty(), then trie_save().
|
|
- Close it with trie_free().
|
|
|
|
Scenario 3: Create a new trie, saving it when finished.
|
|
3a. Prepare an alphabet map:
|
|
- Create new alphabet map with alpha_map_new().
|
|
- Add ranges with alpha_map_add_range().
|
|
3b. Create new trie with trie_new(alpha_map).
|
|
3c. Free the alphabet map with alpha_map_free().
|
|
3d. Use/update the trie.
|
|
3e. On exit:
|
|
- If trie_is_dirty(), then trie_save().
|
|
- Close the trie with trie_free().
|
|
|
|
Scenario 4: Create temporary trie, disposing it when finished.
|
|
4a. Prepare an alphabet map:
|
|
- Create new alphabet map with alpha_map_new().
|
|
- Add ranges with alpha_map_add_range().
|
|
4b. Create new trie with trie_new(alpha_map).
|
|
4c. Free the alphabet map with alpha_map_free().
|
|
4d. Use/update the trie.
|
|
4e. On exit:
|
|
- Close the trie with trie_free().
|
|
|
|
2.2 No More SBTrie
|
|
|
|
In datrie 0.1.x, SBTrie provided a wrapper to Trie implementation, converting
|
|
between real character codes and trie internal codes. This was for compactness,
|
|
as continuous character code range can cause more compact sparse table
|
|
allocation, while the real alphabet set needs not be continuous. However, in
|
|
datrie 0.2.x, this mapping feature has been merged into Trie class, to reduce
|
|
call layers. So, there is no SBTrie any more. You can call Trie directly in the
|
|
same way you called SBTrie in 0.1.x.
|
|
|
|
2.3 Characters are Now Unicode
|
|
|
|
datrie was previously planned to support multiple kinds of character encodings,
|
|
with only single-byte encoding as the available implementation for the time
|
|
being.
|
|
|
|
However, as there have been many requests for Unicode support, it seems to be
|
|
the most useful choice, into which all other encodings can be converted.
|
|
|
|
Furthermore, as datrie is mostly used in program's critical path, having too
|
|
many layers can contribute to being a bottleneck. So, only Unicode is accepted
|
|
in this version. It's now the application's duty to convert its keys into
|
|
Unicode before passing them to datrie. This should also allow any kind of
|
|
possible caching.
|
|
|
|
2.4 New Public APIs for Alphabet Map
|
|
|
|
As AlphaMap (alphabet map) is now necessary for creating a new empty trie, the
|
|
APIs for manipulating this data is now exposed to the public scope. See
|
|
<datrie/alpha-map.h> for the details.
|
|
|
|
2.5 Extensions to TrieState
|
|
|
|
trie_state_copy()
|
|
|
|
As part of performance profiling, allocating and freeing TrieState is found
|
|
to eat up CPU time at some degree. So, reusing existing TrieState where
|
|
possible does help. This function is added for copying TrieState data, as a
|
|
better alternative than trie_state_clone().
|
|
|
|
trie_state_is_single()
|
|
|
|
Sometimes, checking if a TrieState is a leaf state is too expensive for
|
|
program's critical path. It needs to check both whether the state is in a
|
|
non-branching path, that is, whether it is in a suffix node, and whether it
|
|
can be walked by a terminator. When a program only needs to check for the
|
|
former fact and not the latter, this method is at disposal.
|
|
|
|
3. Changes to TrieTool
|
|
|
|
3.1 Renaming
|
|
|
|
To allow co-existence with 0.1.x trietool, 0.2.x trietool is named
|
|
trietool-0.2.
|
|
|
|
3.2 '*.abm' Instead of '*.sbm'
|
|
|
|
As SBTrie has been eliminated in datrie 0.2.x, the corresponding '*.sbm'
|
|
(single-byte map) input file is also obsoleted. It is now renamed to '*.abm'
|
|
(alphabet map). Its format is also redefined to be Unicode-based. All alphabet
|
|
character ranges are defined in Unicode.
|
|
|
|
Besides, the '*.abm' file is required only once at trie creation time. It is
|
|
not needed at deployment, as the alphabet map is already included in the single
|
|
trie file.
|
|
|
|
3.3 Encoding Conversion Support
|
|
|
|
As datrie is now Unicode-based, conversion from other encodings can be useful.
|
|
This is possible for word list operations, namely add-list and delete-list, by
|
|
the additional '-e {enc}' or '--encoding {enc}' option. This option specifies
|
|
the character encoding of the word list file. And trietool-0.2 will convert the
|
|
contents to Unicode on-the-fly.
|
|
|