Document Database SDK 📄

The VecML Document Database SDK provides a high-performance, scalable document storage and retrieval system for applications requiring efficient document indexing, querying, and storage management. Upon this, you can build fast, scalable, and memory-efficient document storage and retrieval systems. 🚀

🛠 Creating a Fluffy Document Interface

To begin using the Fluffy Document Database, initialize an instance of FluffyDocumentInterface. This class manages document storage, indexing, and retrieval.

Initialization

#include "fluffy_document_interface.h"

std::string database_path = "path/to/database";
std::string license_path = "license.txt";

// Create a document collection instance
fluffy::FluffyDocumentInterface docInterface(database_path, license_path);

database_path: Directory where document data and indices are stored.
license_path: Path to a valid license file.

About License

Please contact sales@vecml.com to obtain a valid license.txt. The license file is required to initialize and use the VecML SDK. Without a valid license, functionalities are restricted or unavailable. Ensure that the license.txt file is placed in the correct directory and accessible by your application to avoid initialization errors.

📥 Adding Documents

You can add documents and its attributes to the database using the add_document() method. Each document is identified by a unique document string ID.

fluffy::idx_t document_id = "unique_id";
std::string text = "This is an example document.";
std::string path = "path/to/the/file";               // you can specify the file path if needed

std::string title = "document_title";
std::string category = "news";
std::unordered_map<std::string, std::string> attributes = {{"title", title}, {"category", category}};

// Insert the document into the collection
fluffy::ErrorCode status = docInterface.add_document(document_id, text, path, attributes);

if (status == fluffy::ErrorCode::Success) {
    std::cout << "Document added successfully!" << std::endl;
} else {
    std::cerr << "Failed to add document. Error code: " << static_cast<int>(status) << std::endl;
}

🔍 Searching Documents

Full-text Search

To perform full-text search, use search_documents(). The function returns relevant documents based on the provided query.

std::string query = "example";
fluffy::InterfaceDocumentQueryResults results;

// Execute search
int top_k = 10;
fluffy::ErrorCode status = docInterface.search_documents(query, top_k, results);

if (status == fluffy::ErrorCode::Success) {
    std::cout << "Full text search with keyword: " << keyword << ":" << std::endl;
    for (auto& doc : results.results) {
        std::string doc_id = doc.document_id;
        std::string raw_text;
        docInterface.get_text(doc_id, raw_text);

        std::cout << doc_id << "  " << raw_text << std::endl;
    }
} else {
    std::cerr << "Search failed. Error code: " << static_cast<int>(status) << std::endl;
}

get_text() can be used to retrieve the full text of a document specified by its ID.

Results Structure (InterfaceDocumentQueryResults)

A list of matching document string IDs, sorted by similarity from high to low in terms of full-text keyword (fuzzy) matching.

Attribute Search (e.g., by Title, Author)

To conduct keyword search on a specific attribute (for example, by document title or author), follow the two steps below:

Create an index for the attribute using attach_attribute_index():

docInterface.attach_attribute_index("title");        // create an index for attribute "title"

After attaching the index to the document interface, all documents added in the future will be added to the index automatically.

Use search_attribute() function to perform keyword search on an attribute:

    std::string title_keyword = "Animali";
    fluffy::InterfaceDocumentQueryResults attr_results;
    int top_k = 10;
    docInterface.search_attribute(title_keyword, "title", top_k, attr_results);

    std::cout << "Title search with keyword" << title_keyword << ":" << std::endl;;
    for (auto& doc : attr_results.results) {
        std::string doc_id = doc.document_id;
        std::string title;
        docInterface.get_document_attribute(doc_id, "title", title);

        std::cout << doc_id << "  " << title << std::endl;
    }

get_document_attribute() can be used to retrieve an attribute from a document specified by its ID.

Results Structure (InterfaceDocumentQueryResults)

A list of matching document string IDs, sorted by similarity from high to low in terms of atttribute keyword (fuzzy) matching.

🗑 Removing a Document

To remove a document, call remove_document().

std::string document_id = "id_to_remove";

// Remove the document
fluffy::ErrorCode status = docInterface.remove_document(document_id);

if (status == fluffy::ErrorCode::Success) {
    std::cout << "Document removed successfully!" << std::endl;
} else {
    std::cerr << "Failed to remove document. Error code: " << static_cast<int>(status) << std::endl;
}

💾 Managing Storage and Performance

Flushing Data to Disk

To ensure that all changes are persisted to disk, call flush():

fluffy::ErrorCode status = docInterface.flush();
if (status == fluffy::ErrorCode::Success) {
    std::cout << "Data successfully flushed to disk." << std::endl;
}

Calling flush() prevents data loss in case of system crashes.

Optimizing Memory Usage

To reduce memory usage, use offload(), which moves unused data from memory to disk.

fluffy::ErrorCode status = docInterface.offload();
if (status == fluffy::ErrorCode::Success) {
    std::cout << "Memory offloaded successfully." << std::endl;
}

Key Notes:

The system will automatically reload necessary data when required.
Calling offload() too frequently may impact performance.

Fine-Grained Offloading

If you need more control over offloading, use:

offload_data(): Moves document text data to disk.
offload_index(): Moves indexing structures to disk.

docInterface.offload_data();  // Offload document text data
docInterface.offload_index(); // Offload indexing structures

Use these selectively when memory is constrained.

🚀 Best Practices

✅ Use flush() regularly to persist changes.
✅ Use offload() after indexing and querying to free memory.
✅ Batch insert documents to improve performance.
✅ Ensure unique document IDs to avoid overwriting data.