> ## Documentation Index
> Fetch the complete documentation index at: https://docs.search-artemis.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Understanding Datasets

> Learn about datasets in Artemis Search and how they power intelligent searches

<Tip>
  We have provided a limited sample dataset for you to explore with. It contains \~19,000 company descriptions. In the case you are unable to find matches for your searches, we recommend uploading a bigger dataset.
</Tip>

## What is a Dataset?

A Dataset stores the content that Artemis Search will search over.
Datasets must be pandas dataframes exported as a parquet file with an "embedding" column containing OpenAI text-large-3 embeddings and a "tag" column with associated string values.

<Note>
  Artemis Search only searches through activated datasets.
</Note>

## Key Concepts

<CardGroup cols={2}>
  <Card title="Embeddings" icon="vector-square">
    Vector representations of text, allowing for semantic similarity comparisons.
  </Card>

  <Card title="Tags" icon="tag">
    Metadata associated with each embedding, returned as content in search results. **This is the only column returned in search results.**
  </Card>

  <Card title="Filter Columns" icon="filter">
    Columns in the dataset that can be used to filter the search results.
  </Card>

  <Card title="Project Association" icon="folder-tree">
    Datasets are linked to specific projects for organized search tasks.
  </Card>
</CardGroup>

## Dataset Activation

Within a project, you can have multiple datasets, but only one can be active at a time. The active dataset is the one used for search queries in that project.

<Warning>
  At least one dataset must be active for a project to be operational and allow searches.
</Warning>

## Creating a Dataset

Datasets are simple to prepare but do require a few steps. Since there are many different workflows for creating datasets, we will present an example workflow for creating a dataset.

<Tip>
  We are happy to help prepare your dataset for you if you reach out to us at [pallavi@artemisar.com](mailto:pallavi@artemisar.com).
</Tip>

### Background

Suppose we have the following dataset stored in a Pandas dataframe:

| company\_description                                | company\_name       | id |
| --------------------------------------------------- | ------------------- | -- |
| 'Acme is a startup that makes widgets'              | 'Acme'              | 1  |
| 'Wayne Enterprises is a startup that makes widgets' | 'Wayne Enterprises' | 2  |
| 'Parker Industries is a startup that makes widgets' | 'Parker Industries' | 3  |

### Example Workflow

<Steps>
  <Step title="Choosing Source Data">
    Artemis Search datasets consist of embeddings of the text you want to search over as well as associated string tags for each embedding. These tags could represent IDs, names, or any other unique information associated with your embedding.

    In our case, it would make the most sense to use `company_description` as the text we are embedding and the `id` as the tags. This choice makes sense since we want to be able to search over the company descriptions and the ids uniquely identify each company.
  </Step>

  <Step title="Preparing the Dataset">
    Ultimately, we need to end up with Pandas dataframe exported as a parquet file where one column is 'embedding' and the other is 'tag'.

    Given our initial dataframe, we will need to transform the `company_description` column into embeddings and the `id` column into string tags, and then store the result as a parquet file.

    ```python theme={null}
    import pandas as pd
    from openai import OpenAI

    # Load Initial Dataframe
    df = pd.read_csv('company_descriptions.csv')

    # Create Embeddings
    client = OpenAI(api_key=env.process.API_KEY)
    text_to_embed = df['company_description'].tolist()
    embedding_responses = client.embeddings.create(input=text_to_embed, model='text-embedding-3-large')
    embeddings = np.vstack([embedding.embedding for embedding in embedding_responses.data])

    # Create Final Dataframe
    df['embedding'] = embeddings
    df['tag'] = df['id'].astype(str)

    # Save as a Parquet File
    df.to_parquet('data.parquet')
    ```
  </Step>
</Steps>

## Filter Columns

Filter columns are columns in the dataset that can be referenced in the "filter\_query" parameter of a search to filter the search results. Practically, they are any column in the dataset that is not named `embedding` or `tag`.

<Warning>
  These columns cannot be named `embedding` or `tag` since these are reserved column names.
</Warning>

### Example

Consider the following dataset stored in a Pandas dataframe:

| embedding | tag                 | size |
| --------- | ------------------- | ---- |
| \[...]    | 'Acme'              | 1    |
| \[...]    | 'Parker Industries' | 3    |

You can see we have the `embedding` and `tag` columns that we need. However, we also have an `size` column. We would call this column a `filter_column` since it is not used for searching directly but can be used to filter the search results.

<Tip>
  You can read more about filter queries [here](/search/search-parameters#filter-query).
</Tip>

## Next Steps

Now that you understand the basics of datasets in Artemis Search, learn how to:

<CardGroup cols={2}>
  <Card title="Create a Dataset" icon="upload" href="/datasets/create-a-dataset">
    Create a dataset for your project with your custom data
  </Card>

  <Card title="Manage Datasets" icon="list-check" href="/datasets/manage-datasets">
    View, activate, edit, and delete your datasets
  </Card>
</CardGroup>
