mirror of https://gitee.com/openkylin/presage.git
287 lines
9.6 KiB
Plaintext
287 lines
9.6 KiB
Plaintext
Copyright (C) 2008 Matteo Vescovi <matteo.vescovi@yahoo.co.uk>
|
|
___________________
|
|
The Presage project
|
|
~~~~~~~~~~~~~~~~~~~
|
|
|
|
TODO list
|
|
---------
|
|
|
|
|
|
GUI apps:
|
|
* qprompter
|
|
** integrate into build system
|
|
* gprompter
|
|
** gray in and out redo and undo menu items
|
|
** toolbar icon size
|
|
** autocomp max height
|
|
** would be nice to have status bar with KSR rate
|
|
|
|
Architectural restructure:
|
|
- n-gram language model database format and database connector
|
|
|
|
The current database format stores the string in all n-grams,
|
|
i.e. for "the quick brown" fragment we'll have
|
|
|
|
1-gram: <word, count>
|
|
<the, 20>
|
|
|
|
2-gram: <word_1, word, count>
|
|
<the, brown, 10>
|
|
|
|
3-gram: <word_2, word_1, word, count>
|
|
<the, quick, brown, 1>
|
|
|
|
A possibly more time-efficient and space-efficient approach to
|
|
structuring the language model involves having n-gram records refer
|
|
to (n-1)-gram records instead of repeating the word strings, i.e.:
|
|
|
|
1-gram: <uid, word, count>
|
|
<1023, the, 20>
|
|
|
|
2-gram: <uid, 1-gram, word, count>
|
|
<2204, 1023, brown, 10>
|
|
|
|
3-gram: <uid, 2-gram, word, count>
|
|
<3452, 2204, brown, 1>
|
|
|
|
To build up the full 3-gram string "the quick brown", the references
|
|
to the 2-gram and 1-gram need to be walked. However, the predictive
|
|
algorithm needs to look at the counts for all k-grams where k is in
|
|
[1, n], so this would not be an additional time cost. The database
|
|
size would reduce as it would not need to store repetitions of the
|
|
words in each n-gram table.
|
|
|
|
- selector
|
|
should be a class similar to current PredictorActivator i.e. a class
|
|
that invokes other classes' method to perform work.
|
|
Current Selector's functionality should be broken up in Filter
|
|
objects i.e. an abstract Filter class and implementation of various
|
|
filters (repetion filter, greedy filter, etc)
|
|
|
|
- combiner
|
|
clean up the mess that is our current Predictor implementation,
|
|
particularly with regards to the Combiner handling and
|
|
implementation. Considering making Combiner a concrete class that
|
|
uses different CombinationStrategy objects to do combine
|
|
predictions. Combiner object would know how to retrieve its config
|
|
values and which Strategy to create and use.
|
|
|
|
- registry [DONE]:
|
|
|
|
Predictor class functionality should be split up. There should be
|
|
one PluginRegitry class which holds the active plugins and whose
|
|
interface consists of a call that returns an iterator to the
|
|
plugins.
|
|
|
|
Predictor would obtain an iterator from PluginRegistry and invoke
|
|
the predict() method on each Plugin pointed to by the iterator.
|
|
|
|
A new Learner class could invoke the learn() method on them when
|
|
needed.
|
|
|
|
This way, the reverse dependency that implementing learning cause
|
|
between ContextTracker and Predictor would disappear, being
|
|
substituted by a single dependency on Registry and the introduction
|
|
of a new Learner class (name still to decide).
|
|
|
|
The registry should eventually just be a simple wrapper around
|
|
plump.
|
|
|
|
|
|
Short term:
|
|
* Logger
|
|
- implement logger level inheritance from parent module
|
|
- SqliteDatabaseConnector callback: had to disable logging there because
|
|
static method, investigate on how it can be re-enabled
|
|
* test performance with different n values in n-gram
|
|
* Consider removing the following public methods from Variable
|
|
interface:
|
|
. Variable(const std::vector<std::string>& variable);
|
|
. size_t size() const;
|
|
* consider removing src/tools/ngram.* code
|
|
* smoothed n-gram predictor
|
|
- is it possible to reduce calls to count() to improve performance?
|
|
* rationalise user-specific and system data files location and config files location
|
|
- option to comply with XDG basedir spec for config files and data files
|
|
* add proper unicode support
|
|
* determine whether to enable dictionary plugin by default
|
|
(dictionary file?)
|
|
* rewrite strtoupper and strtolower utility functions to use a pointer
|
|
to function to do the individual char conversion
|
|
* add ContextTracker tests for control chars
|
|
* put everything inside the presage namespace
|
|
* write more integration tests
|
|
* write Combiner implementations (various combination strategies)
|
|
* add more tests, increase test coverage
|
|
* bug: validate string passed to sql_exec query function, unsanitized
|
|
string can cause security problems
|
|
* implement activation map predictive plugin
|
|
|
|
- try to improve reverseTokenizer::progress() accuracy
|
|
currently it uses a delta of 0.7, should try to get it down to 0.3
|
|
- Class ContextTracker could initialize Tokenizer's members separator
|
|
and blankspace on a member initializer list. Also, Tokenizer could
|
|
take references to string instead of pointers.
|
|
|
|
Medium term:
|
|
* fix character codes
|
|
* integration of the plump framework
|
|
|
|
Long term:
|
|
* use timer alarm to implement threaded predictor activator
|
|
* improve exceptions handling
|
|
* add more predictive plugins
|
|
|
|
Longer term:
|
|
* add gettext support
|
|
|
|
|
|
|
|
|
|
VARIOUS NOTES
|
|
=============
|
|
|
|
Plugins and Profiles and Managers
|
|
---------------------------------
|
|
|
|
A problem arises when a profile requires that more than one instance
|
|
of a Plugin object is created.
|
|
|
|
profile: pluginA, pluginB, pluginA
|
|
|
|
plugins: pluginA, pluginB, pluginA
|
|
|
|
libraries: libpluginA, libpluginB
|
|
|
|
We need to be able to distinguish (therefore separately manage) plugin
|
|
objects and library objects and profile objects.
|
|
|
|
libpluginA ---> pluginA
|
|
|
|
|
libpluginB -+-> pluginB
|
|
|
|
|
`-> pluginA
|
|
|
|
ProfileManager should invoke the construction of Plugin objects and
|
|
initiate their option values using a PluginFactory class.
|
|
|
|
PluginManager should manager the association between a Plugin object
|
|
and the module (library) object that contains the Plugin.
|
|
|
|
Plump, the Pluggable Lightweight Multithreaded Platform, was created
|
|
to solve this and other problems and is going to become presage's
|
|
plugin framework implementation.
|
|
|
|
|
|
Plump framework integration
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
The dynamic loading and plugin management system currently implemented
|
|
is going to be scrapped in favour of the more general and portable
|
|
plump framework.
|
|
|
|
Plump is a Pluggable Lightweight Ubiquitous Multithreaded Platform
|
|
which makes integration, usage and deployment of a plugin framework
|
|
dead easy.
|
|
|
|
Plump integration into presage will require a number of changes to
|
|
presage architecture, affecting Predictor and PluginManager
|
|
classes in particular.
|
|
|
|
Predictor and PluginManager classes will delegate much of their
|
|
current functionality to plump. Plump will render the functionality
|
|
provided by PluginManager redundant, as everything that
|
|
PluginManager does will be done by plump. Similarly, part of the
|
|
Predictor class functionality will be replaced by plump too.
|
|
|
|
Predictor was intended to be used to execute the plugins in a
|
|
serial or parallel mode. Plump will do that. Predictor will still
|
|
be in charge of collecting the result of each plugin's run and
|
|
combining them into a global prediction.
|
|
|
|
PluginManager was in fact a lesser plump. PluginManager can be
|
|
considered a precursor to plump. Plump has been designed to solve
|
|
the same problems that PluginManager was intended to solve, plus a
|
|
bit more.
|
|
|
|
|
|
Plugins creation and initialisation
|
|
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
A few things should happen:
|
|
plugin objects should be instantiated based on configuration files,
|
|
that is if the configuration file uses the plugin, then an instance
|
|
of the corresponding class implementing the plugin should be
|
|
instantiated
|
|
|
|
plugin objects should be initialised with the options contained in
|
|
the configuration file
|
|
|
|
The most sensible way to achieve this requirements seems to revolve
|
|
around having a plugin factory class which:
|
|
|
|
determines which and how many instances of plugin classes need to
|
|
be instantiated from the xml configuration file
|
|
|
|
passes a pointer to the root the xml representation of the options
|
|
specific to that plugin so that the plugin constructor can
|
|
initialise its internal state accordingly
|
|
|
|
This results in:
|
|
|
|
plugins know how to initialise themselves
|
|
the information required for initizialisation is passed to the
|
|
plugin's constructor
|
|
the information is passed in xml parse tree format
|
|
|
|
|
|
Points to ponder:
|
|
(o) the plugin factory needs to be able to determine which plugin
|
|
class to instantiate a plugin from based on the content of the
|
|
configuration file (xml file). A solution could be that the module
|
|
implementing the plugin class exports a string corresponding to the
|
|
plugin type/name.
|
|
(o) it is necessary to be able to associate a plugin object with
|
|
initialisation data. In other words, each plugin class needs to
|
|
have an associated string that describes its kind. Or we can use
|
|
run-time type information.
|
|
(o) in light of all this, it is probably worth designing a versioning
|
|
system for plugin classes to be implemented as exported symbols in
|
|
the plugin module.
|
|
|
|
|
|
|
|
|
|
STEP to autoconfiscate
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
|
|
aclocal
|
|
libtoolize --force --ltdl
|
|
autoheader
|
|
autoconf
|
|
automake -a --copy
|
|
|
|
or source the bootstrap script provided (in svn repo):
|
|
. bootstrap
|
|
|
|
|
|
########/
|
|
|
|
Copyright (C) 2008 Matteo Vescovi <matteo.vescovi@yahoo.co.uk>
|
|
|
|
Presage is free software; you can redistribute it and/or modify
|
|
it under the terms of the GNU General Public License as published by
|
|
the Free Software Foundation; either version 2 of the License, or
|
|
(at your option) any later version.
|
|
|
|
This program is distributed in the hope that it will be useful,
|
|
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
|
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
|
GNU General Public License for more details.
|
|
|
|
You should have received a copy of the GNU General Public License along
|
|
with this program; if not, write to the Free Software Foundation, Inc.,
|
|
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
|
|
|
|
########\
|