Overview

The gaml library has been supported by the Methodeo project. It consists of a C++ library, based on generic programming techniques, which offers tools for the use of machine learning: real risk estimator, manimulation of data, variable selection, etc... The library iself does not provide regression or classification algorithms, but rather allows the user to wrap around its favorite algorithms some general purpose machine learning features. Nevertheless, the famous libsvm package by Chih-Chung Chang and Chih-Jen Lin has already been included in gaml thanks to the gaml-libsvm extension.

Last, let us insist on one major feature of the gaml lib. It relies on c++ generic programming, which is strongly typed. The design of the library fits the mathematics of machine learning concepts, and thus the strong typing forces the user to comply to those concepts. This is deliberate. The drawback is clearly that the syntax error fixing can be a hard job, since a small error in typing can generate quite a lot of error messages. In spite of this, the benefit is that all the programming effort is concentrated on that point. Indeed, when syntaxically correct, the code leads to a safe and efficient execution . Very few time is spent at debugging run time memory errors then.

use of concepts in generic programming

For those who are not familiar with generic programming, the use of concept may be confusing since classical object oriented relies rather on inheritence mechanisms. A concept is a syntactical requirement. In the gaml library, such requirement are documented through the use of fake classes in the gaml::concept namespace. Let us take the exemple of the gaml::concepts::Predictor concept.

The gaml::concepts::Predictor concept says that some predictor must define two types, names input_type and output_type, and that it should provide some defaut and copy constructors, as well as a operator() method. Let us propose some predictor (dummy...).

class Funny {
public:
  typedef char         input_type
  typedef std::string  output_type
 
  Funny(void) {}
  Funny (const Funny& other) {}
  Funny& operator=(const Funny& other) {}
 
  output_type operator()(const input_type& x) const {
    return std::string(10,x); // "xxxxxxxxxx"
  }
};

This Funny class fits the gaml::concepts::Predictor concept while no inheritance is involved. If some algorithm in the documentation is such as it requires an argument whose type fits the gaml::concept::Predictor concept, this will be specified in the documentation. For example, let us suppose that the function foo is dedicated to the manipulation of some predictor. Its declaration in the gaml lib would be

namespace gaml {
  template<typename Predictor>
  double foo(const Predictor& pred) {....}
}

The use of the function in some code where Funny is available would be

Funny funny;

double result = gaml::foo<Funny>(funny);

This is will compile fine as long as the Funny class fits the gaml::concepts::Predictor concept. Moreover, when the compiler can guess the template parameter type from the function call, the template parameters can be removed. This leads to the following codes, that gives you the flavor of the gaml function calls.

Funny funny;

double result = gaml::foo(funny);

use of concepts in generic programming

This idea of the library is that data belong to collections that can be accessed by iterators. Most algorithms provided in the gaml library take iterators as argument when they have to consider a collection of data. This is complient with the STL programming style. The user is thus responsible for the way s/he stores the data. Consequently s/he has to provide functions that allows to retrieve elements in each single datum in the data set. Typically, data sets contain input/output pairs. The gaml algorithm will be provided with iterators on the dataset and it will acces to successive elements. From each element, the gaml algorithm will have to extract the input and the output contained in the pair. In order not to impose the coding of those pairs to the user, gaml algorithms will have to be given two supplementary extraction functions. Let us write some typical gaml code accordingly.

typedef char                    Input;
typedef std::string             Output;
typedef std::pair<Input,Output> Data;
typedef std::vector<Data>       Samples;
 
const Input&  input_of (const Data& data) {return data.first;}
const Output& output_of(const Data& data) {return data.second;}
 
int main(...) {
  Samples basis;
 
  // Let us fill the basis.
  basis.resize(100);
  for(Samples::iterator iter = basis.begin(); iter != basis.end(); ++iter) {
    Data& data  =  *iter;
    data.first  = // init some input here
    data.second = // init some output here
  }
 
  // Let us set up a shuffled basis.
  gaml::Shuffle<Samples::iterator,nasty-functional-types> shuffled = gaml::shuffle(basis.begin(),
                                           basis.end());
 
  // Let us compute something
  Funny funny;
  risk = gaml::some_algo(funny,
                         shuffled.begin(), shuffled.end(), // We iterate on the shuffled basis.
                 input_of,     // These are the
                 output_of);   // extraction functions.
}

The previous code benefits from the template parameter implicite resolution, since gaml::some_algo is a template function, whose type parameters can be ommitted, as mentioned for gaml::foo previously. It can be simplified further. First, C++11 provide smarts notation for interation on collections (a new for loop syntax). Second, the auto keyword can be used where a type name is required, when the type can be guessed by the compiler. This is the case for the gaml::Shuffle<Samples::iterator,nasty-functional-types> obscure type provided by gaml. This leads to rewrite the code as this.

typedef char                    Input;
typedef std::string             Output;
typedef std::pair<Input,Output> Data;
typedef std::vector<Data>       Samples;
 
const Input&  input_of (const Data& data) {return data.first;}
const Output& output_of(const Data& data) {return data.second;}
 
int main(...) {
  Samples basis;
 
  // Let us fill the basis.
  basis.resize(100);
  for(auto& data : basis){
    data.first  = // init some input here
    data.second = // init some output here
  }
 
  // Let us set up a shuffled basis.
  auto shuffled = gaml::shuffle(basis.begin(),basis.end());
 
  // Let us compute something
  Funny funny;
  risk = gaml::some_algo(funny,
                         shuffled.begin(), shuffled.end(), 
                 input_of,
                 output_of);
}

Moreover C++11 provides a syntax for the definition of functions on the fly in the code (lambda functions). This can be done for input_of and output_of. This leads to rewrite the code as this.

typedef char                    Input;
typedef std::string             Output;
typedef std::pair<Input,Output> Data;
typedef std::vector<Data>       Samples;
 
int main(...) {
  Samples basis;
 
  // Let us fill the basis.
  basis.resize(100);
  for(auto& data : basis){
    data.first  = // init some input here
    data.second = // init some output here
  }
 
  // Let us set up a shuffled basis.
  auto shuffled = gaml::shuffle(basis.begin(),basis.end());
 
  // Let us compute something
  Funny funny;
  risk = gaml::some_algo(funny,
                         shuffled.begin(), shuffled.end(), 
                 [](const Data& data) -> const Input&  {return data.first;},
                 [](const Data& data) -> const Output& {return data.second;});
}

the documentation

The user manual of the gaml library consists of a set of examples, available in this documentation. They are ordered, and they should be read in that order to get a comprehensive overview of the gaml features.

Conclusion

The use of gaml implies invoking templates that can by intricated. Thanks to C++11 syntactical elements (as auto), this intrication can be hidden to the user so that the code is kept readible. The code expresses naturally the machine learning methodological concepts, and type checking ensures that they are not misused. Once compiled, as all the type checking effort is made at compiling time, the executable is safe and efficient.