Clone wiki

quechua / Home


Quechua is a data mining and machine learning lightweight framework especially for analyzing security threats (however the framework can be used in a 'non security' analysis :-) )

As Quechua tries to be modular, it makes particular components reusable. This means module implementing certain algorithm can be used with many Channels and Processors as long as the Channel/Processor module wraps data in structures the Algorithm can 'understand' (that is, it knows what methods/functions will be called, and basically includes the declaration of the class type).

Quechua handles 4 types of components:

  • Channel - fetches data from external world,
  • Processor - prepares data for algorithms,
  • Algorithm - does actual work on prepared data,
  • Logger - logs results to the external world

General architecture

As mentioned above, Quechua's architecture is modular. Thanks to that, certain modules can be reused. For example the preconfigured workflow can be changed by connecting another channel, that reads data from pcap file, instead of database. The only constraint is that adjacent components must understand each other. Channels and Loggers instantions are application wide. This is, if configured, they are initialized at the very beggining and exist independently, no matter what workflows and whether any of them exist. Workflows, on the other hand are instances of desired work, consisted of the components. It instantiates Processor and Algorithm components, register its' interest to the certain Channel and Logger and, under specific circumstances. triggers Channels to do its Channel-specific work. The overall architecture looks like:



It is up to you what shared modules (.so files) will be loaded, what channels and loggers will be instantiated and what is most important, how will your workflows be configured. This can be done in configuration file, stored in /opt/quechua/etc/quechua.conf default configuration file.



Channels are components that read data and prepare it for workflows. In other words it's channel's job to initate database connection, handle all connection errors, but also 'understand' what schema/table should be accessed and read. Other examples are listening on certain socket, reading text file etc. Sometimes, it's better to aggregate some data before sending it to the workflows (usually processor). This is also channel's responsibility to do such tasks. By design, it's channel that initiates the whole flow. That is, only channel can trigger registered workflows.


Loggers are modules that store result. They work just like Channels, but they usually write, not read data. Again, they should know where and how the results should be stored.


Workflows are the most important structure. One can instantiate the workflow by creating flow entry in the configuration. Workflow was projected to let users create 'flows' by using certain components.

Its main principles and tasks are:

  • Instantiate and handle Processor and Algorithm components,
  • Register themselves in certain Channels,
  • Link loggerswith certain workflow, so the results can be logged

In most cases it is the Channel component that notifies all registered workflows about collected data. In the current version one workflow can be registered in only Channel, however many workflows can be registered in one Channel (plus multiple Channels can co-exist in one configuration).

For more information about setting up configuration file please, take a look here

Passing data packs between certain components

Since Quechua itself is the framework, it's almost impossible (and hence very impractical) to project one useful format for storing data produced by all the modules. That's why not all adjacent components will cooperate. It's because sometimes the data processed should be stored as plain transactions (rows), key-value or other structures. Of course we advise to use the most general DataPack inheritants.

If you decide to write you own class definition that stores your data the way you want/need, there is one requirement. You have to subclass DataPack class. You don't need to define any methods (well, you should DataPack::initstamp() if you want to use Stamp, see below). Than, receiving component (say Processor) just need to cast received object (it expects DataPack object) and than call methods comming from subclassed object. The best way to make DataPack inheritants reusable is to store their definitions in $QUECHUA/modules/data/ directory.

If you want to transport some information that may be useful for proper handling the whole workflow (a process of certain actions, not object type Workflow ) you may use Stamp subclass, where shared pointer is stored in every DataPack object. Stamp is just a starter point and does literally nothing. It's just a type to subclass. When you add a Stamp subclass somewhere during one of the components work it will be included in every DataPack produced in other components within the certain workflow.

Using Stamp in practice

As example for using Stamp usage. Say you channel reads data from the database, literally from the table storing tuples with requests and some parameters. You may want to log results and add operation id those results correspond. The best and easiest way is to create Stamp subclass, say:

class OperationStamp : public Stamp {
     int operation;
     OperationStamp(int oid) : operation(oid) {};
     virtual ~OPerationStamp();
     const int getoper() const { return operation; }

or just (if you are not afraid of yourself using public variables):

struct OpStamp : public Stamp {
    int oid;

which may look better and be more handy in some situations :-)