Skip to content

Document Database SDK 📄

The VecML Document Database SDK provides a high-performance, scalable document storage and retrieval system for applications requiring efficient document indexing, querying, and storage management. Upon this, you can build fast, scalable, and memory-efficient document storage and retrieval systems. 🚀


🛠 Creating a Fluffy Document Interface

To begin using the Fluffy Document Database, initialize an instance of FluffyDocumentInterface. This class manages document storage, indexing, and retrieval.

Initialization

#include "fluffy_document_interface.h"

std::string database_path = "path/to/database";
std::string license_path = "license.txt";

// Create a document collection instance
fluffy::FluffyDocumentInterface docInterface(database_path, license_path);
  • database_path: Directory where document data and indices are stored.
  • license_path: Path to a valid license file.

About License

Please contact sales@vecml.com to obtain a valid license.txt. The license file is required to initialize and use the VecML SDK. Without a valid license, functionalities are restricted or unavailable. Ensure that the license.txt file is placed in the correct directory and accessible by your application to avoid initialization errors.


📥 Adding Documents

You can add documents and its attributes to the database using the add_document() method. Each document is identified by a unique document string ID.

fluffy::idx_t document_id = "unique_id";
std::string text = "This is an example document.";
std::string path = "path/to/the/file";               // you can specify the file path if needed

std::string title = "document_title";
std::string category = "news";
std::unordered_map<std::string, std::string> attributes = {{"title", title}, {"category", category}};

// Insert the document into the collection
fluffy::ErrorCode status = docInterface.add_document(document_id, text, path, attributes);

if (status == fluffy::ErrorCode::Success) {
    std::cout << "Document added successfully!" << std::endl;
} else {
    std::cerr << "Failed to add document. Error code: " << static_cast<int>(status) << std::endl;
}

🔍 Searching Documents

To perform full-text search, use search_documents(). The function returns relevant documents based on the provided query.

std::string query = "example";
fluffy::InterfaceDocumentQueryResults results;

// Execute search
int top_k = 10;
fluffy::ErrorCode status = docInterface.search_documents(query, top_k, results);

if (status == fluffy::ErrorCode::Success) {
    std::cout << "Full text search with keyword: " << keyword << ":" << std::endl;
    for (auto& doc : results.results) {
        std::string doc_id = doc.document_id;
        std::string raw_text;
        docInterface.get_text(doc_id, raw_text);

        std::cout << doc_id << "  " << raw_text << std::endl;
    }
} else {
    std::cerr << "Search failed. Error code: " << static_cast<int>(status) << std::endl;
}

get_text() can be used to retrieve the full text of a document specified by its ID.

Results Structure (InterfaceDocumentQueryResults)

  • A list of matching document string IDs, sorted by similarity from high to low in terms of full-text keyword (fuzzy) matching.

Attribute Search (e.g., by Title, Author)

To conduct keyword search on a specific attribute (for example, by document title or author), follow the two steps below:

  1. Create an index for the attribute using attach_attribute_index():
docInterface.attach_attribute_index("title");        // create an index for attribute "title"

After attaching the index to the document interface, all documents added in the future will be added to the index automatically.

  1. Use search_attribute() function to perform keyword search on an attribute:
    std::string title_keyword = "Animali";
    fluffy::InterfaceDocumentQueryResults attr_results;
    int top_k = 10;
    docInterface.search_attribute(title_keyword, "title", top_k, attr_results);

    std::cout << "Title search with keyword" << title_keyword << ":" << std::endl;;
    for (auto& doc : attr_results.results) {
        std::string doc_id = doc.document_id;
        std::string title;
        docInterface.get_document_attribute(doc_id, "title", title);

        std::cout << doc_id << "  " << title << std::endl;
    }

get_document_attribute() can be used to retrieve an attribute from a document specified by its ID.

Results Structure (InterfaceDocumentQueryResults)

  • A list of matching document string IDs, sorted by similarity from high to low in terms of atttribute keyword (fuzzy) matching.

🗑 Removing a Document

To remove a document, call remove_document().

std::string document_id = "id_to_remove";

// Remove the document
fluffy::ErrorCode status = docInterface.remove_document(document_id);

if (status == fluffy::ErrorCode::Success) {
    std::cout << "Document removed successfully!" << std::endl;
} else {
    std::cerr << "Failed to remove document. Error code: " << static_cast<int>(status) << std::endl;
}

💾 Managing Storage and Performance

Flushing Data to Disk

To ensure that all changes are persisted to disk, call flush():

fluffy::ErrorCode status = docInterface.flush();
if (status == fluffy::ErrorCode::Success) {
    std::cout << "Data successfully flushed to disk." << std::endl;
}
Calling flush() prevents data loss in case of system crashes.


Optimizing Memory Usage

To reduce memory usage, use offload(), which moves unused data from memory to disk.

fluffy::ErrorCode status = docInterface.offload();
if (status == fluffy::ErrorCode::Success) {
    std::cout << "Memory offloaded successfully." << std::endl;
}

Key Notes:

  • The system will automatically reload necessary data when required.
  • Calling offload() too frequently may impact performance.

Fine-Grained Offloading

If you need more control over offloading, use:

  • offload_data(): Moves document text data to disk.
  • offload_index(): Moves indexing structures to disk.
docInterface.offload_data();  // Offload document text data
docInterface.offload_index(); // Offload indexing structures

Use these selectively when memory is constrained.


🚀 Best Practices

Use flush() regularly to persist changes.
Use offload() after indexing and querying to free memory.
Batch insert documents to improve performance.
Ensure unique document IDs to avoid overwriting data.