DataFrame Implementation
Overview
The dataframe_t is ML-in-C's core datatype, which is designed to drive most machine learning operations. It uses a row-major flattened array to store data, but can be represented as a 2D array or a matrix by utilizing its n_rows and n_cols struct properties. The initial implementation features minimal suite of helper functions to aid the user in data manipulation tasks. However, one glaring limitation is that its data is fixed to the double datatype, meaning string operations are not supported as of writing. Moreover, missing values or NaNs are not handled as of yet, which could result in a corrupted dataframe if a dataset with mismatched rows/columns or missing data is used.
Major Design Choices
Data Manipulation
While the library supports essential row/column operations such as appending, dropping, slicing, sorting, and masking, they are made intentionally lightweight. The primary focus of the library is the machine learning workflow, hence, extensive data manipulation features were not a priority in the initial implementation.
Memory Layout
While the dataframe_t represents a matrix, it stores its data as a row-major one-dimensional array, taking inspiration from numpy's ndarray design. While not ideal for data manipulation tasks, it provides benefits in machine learning workflows; this is in line with model training, which often iterates through rows multiple times.
Data access is done by strides, utilizing the dataframe_t's n_rows and n_cols properties. Accessing a column in a row can be done by using the following formula:
Where:
- is the target row index;
- is the target column index;
- and is the total number of columns.
Although sub-optimal for columnar operations, this contiguous row-major format groups entire rows together as blocks. This allows for efficient memcpy operations when accessing/cloning rows, avoiding the overhead of iterating column by column.
Byte Masking
A separate struct called df_mask_t was made, bundled with its helper functions to perform logical AND and OR operations on the filtering conditions. Under the hood, it stores a corresponding uint8_t for each row index. The uint8_t datatype was chosen for memory efficiency as it is guaranteed to occupy only 1 byte of memory, unlike <stdbool.h>'s bool implementation-specific size variations.
Unlike pandas' boolean mask, the df_mask_t performs byte masking, and is set to two possible values: 0x00 and 0xFF. Initially, all values of the mask is set to 0xFF. The user may modify this by using the logical AND and OR helper functions to apply conditions accordingly.
Byte masking was preferred over boolean masking because it allows bitwise operations, provides memory predictability, and more options for future optimizations.
Sorting Algorithm
An iterative variant of the Lomuto Quick Sort was chosen as the sorting algorithm. While iteration makes it slightly harder to understand compared to recursion, it helps with predictability, performance, and avoidance of errors.
Lomuto partitioning partitions the array and picks the last element as the pivot. While slower than other partitioning algorithms, it is arguably the simplest. Sorting is not expected to be frequently used as this library is focused on the machine learning workflow, which is why a simple algorithm was preferred despite its tradeoffs. Moreover, optimization for the sorting functionality is planned but not prioritized for future updates.
Quick Sort was the algorithm of choice due to its in-place design, meaning it does not require an additional array for sorting. It uses extra memory and time complexity on average.
Datasets may get large in practice, which is why the memory-efficient Quick Sort was preferred over the stable Merge Sort.
One noticeable limitation in the current implementation is the lack of tie-breakers. In other words, the dataframe may only be sorted by one key at a time, unlike SQL's ORDER BY keyword, which allows for multiple key sorting. Again, sorting is not expected to be frequently used, hence, this feature is not prioritized.