## Big Learning Package

Big Learning Package is a minimization package for Big Data Machine Learning problems (not only) based on state-of-art convex minimization algorithms mainly randomized coordinate methods. This package tries to be a very easy to use and roboust implentation that is very versatile. It is implemented using very fast Eigen library and can be implemented on clusters/shared memory system. The implementation contains so far:

- Full Gradient Method Implementation (FG)
- Coordinate Descend Method Implementation (CD)
- Parallel Coordinate Descend Method (PCDM) with nice samplings based on the paper: http://arxiv.org/abs/1212.0873

The library implements parallelization only with OpenMP. However MPI support is likely to appear in time.

In terms of data imput formats, we currently support only data loading in popular big data numerical format HDF5. Copyright HDF group. http://www.hdfgroup.org/HDF5/ We plant to include simple csv format in near future. (but if your data fits into a csv file without significantly reducing storage of your disk space, it should not be considered big data in first place).

## Requirements

The implementation runs on all common platforms. It is written in C++, and requires libraries:

Library | Version | Licence |
---|---|---|

Eigen | 3.+ | GNU/GLP |

Vigra | 1.8+ | Unknown |

libhdf5 | 2.0+ | Non commerical use for free |

CMAKE | 2.8+ | GNU/GLP |

The Build is performed via Cmake (Cross-Platform Make). Please use the newest version so that you do not have to input path of the various libraries manually. If you have to do this anyway, I recommend using cmake-gui or ccmake.

This software has been so far tested only on Linux Machines, but should run on other major platforms as well.

## How to install

- First clone your repository with git as `git clone https://Mojusko@bitbucket.org/Mojusko/biglearning.git`_
- The previous command download the package. Now you can construct a code you want to run with specified algoritms. If you are more experienced programmer, you can simply include headers, and source files in your program and use it as a library.
- Create build directory (if does not exist mdkir build)
- cd build
- Type: cmake ..
- make
- ./biglearning to run test aplication

## Customization

The main contribution of this implentation is a static library (learning.a) that one can compile and link with his/her own project. If you look into our very simple CmakeLists.txt you can immediatelly guess how one should include our library to your project.

## Theoretical Background

Theoretical background behind this optimal implementation can be found in docs directory of this repository.

We present a graph, where we show relative minimization power of PCDM with increasing number of cores.

## Tutorial & Example

In the main directory one can find main.cpp which shows how function from Big Learning Package should be used. The whole process can be split into 4 categories: Loading, Passing, Initializing, Minimizing.

//--------- Loading Data Loader_HDF5 Loader; // Creating Loader Module Loader.Set_Access_Mode("RAM"); //Access Mode, Data are Handled in RAM Loader.Load("../LeSqr_Data_n_10_m_10.h5"); // We specify data we want

As a side note we say that the LeSqr_Data_n_10_m_10.h5 looks inside as follows: data=10x10 matrix labels=10x1 matrix. One can check this with HDFView (a very useful tool).

//--------- Passing of Data & Creation of Model //We create two models; Logistic_Regression F; Least_Squares_Regression F2; // Pass Data to the Function F2.Pass_Data_HDF5(&Loader,"data","labels",10,10); // Specify the Loader, and name of the datasets n=10,m=10 size of the datasets

//--------- Create Min Model Full_Gradient Min1(&F2); Random_Coordinate_Descend Min2(&F2); Parallel_Random_Coordinate_Descend Min3(&F2);

//-------- Minimize double Time=10; Time_Stop Criteria(1,Time,clock()); //Terminates after Time second, first argument signifies number of cores running // Minimizes the function, with stepsizes based on Lipchitz constants with //Random starting point, Full Gradient method Min1.Optimize("Lipschitz",&Criteria,"Random"); cout << Min1.Get_Min().transpose() << endl; Error_Stop Criteria2(0,0.1); // First argument corresponds to Func. Min, and second to desired error // Minimizes the function, with stepsizes based on Lipchitz constants with Random starting point, PCDM with number of cores=2 Min3.Optimize(2,"Lipschitz",&Criteria2,"Random"); cout << Min3.Get_Min().transpose() << endl;

## To come

- Power Method for Large Matrices to determine approx. Lipschitz constant
- HYDRA algorithm
- APPROX algorithm
- Heuristic pick of Step-size
- Custom step-sizes
- Custom stopping criteria creator
- Passive RAM option
- CSV file input

## Licence and no Warranty

This is a free software under GLP licence. This software has absolutely no warranty.