We have provided a limited sample dataset for you to explore with. It contains ~19,000 company descriptions. In the case you are unable to find matches for your searches, we recommend uploading a bigger dataset.

What is a Dataset?

A Dataset stores the content that Artemis Search will search over. Datasets must be pandas dataframes exported as a parquet file with an “embedding” column containing OpenAI text-large-3 embeddings and a “tag” column with associated string values.

Artemis Search only searches through activated datasets.

Key Concepts

Embeddings

Vector representations of text, allowing for semantic similarity comparisons.

Tags

Metadata associated with each embedding, returned as content in search results. This is the only column returned in search results.

Filter Columns

Columns in the dataset that can be used to filter the search results.

Project Association

Datasets are linked to specific projects for organized search tasks.

Dataset Activation

Within a project, you can have multiple datasets, but only one can be active at a time. The active dataset is the one used for search queries in that project.

At least one dataset must be active for a project to be operational and allow searches.

Creating a Dataset

Datasets are simple to prepare but do require a few steps. Since there are many different workflows for creating datasets, we will present an example workflow for creating a dataset.

We are happy to help prepare your dataset for you if you reach out to us at pallavi@artemisar.com.

Background

Suppose we have the following dataset stored in a Pandas dataframe:

company_descriptioncompany_nameid
’Acme is a startup that makes widgets''Acme’1
’Wayne Enterprises is a startup that makes widgets''Wayne Enterprises’2
’Parker Industries is a startup that makes widgets''Parker Industries’3

Example Workflow

1

Choosing Source Data

Artemis Search datasets consist of embeddings of the text you want to search over as well as associated string tags for each embedding. These tags could represent IDs, names, or any other unique information associated with your embedding.

In our case, it would make the most sense to use company_description as the text we are embedding and the id as the tags. This choice makes sense since we want to be able to search over the company descriptions and the ids uniquely identify each company.

2

Preparing the Dataset

Ultimately, we need to end up with Pandas dataframe exported as a parquet file where one column is ‘embedding’ and the other is ‘tag’.

Given our initial dataframe, we will need to transform the company_description column into embeddings and the id column into string tags, and then store the result as a parquet file.

import pandas as pd
from openai import OpenAI

# Load Initial Dataframe
df = pd.read_csv('company_descriptions.csv')

# Create Embeddings
client = OpenAI(api_key=env.process.API_KEY)
text_to_embed = df['company_description'].tolist()
embedding_responses = client.embeddings.create(input=text_to_embed, model='text-embedding-3-large')
embeddings = np.vstack([embedding.embedding for embedding in embedding_responses.data])

# Create Final Dataframe
df['embedding'] = embeddings
df['tag'] = df['id'].astype(str)

# Save as a Parquet File
df.to_parquet('data.parquet')

Filter Columns

Filter columns are columns in the dataset that can be referenced in the “filter_query” parameter of a search to filter the search results. Practically, they are any column in the dataset that is not named embedding or tag.

These columns cannot be named embedding or tag since these are reserved column names.

Example

Consider the following dataset stored in a Pandas dataframe:

embeddingtagsize
[…]’Acme’1
[…]’Parker Industries’3

You can see we have the embedding and tag columns that we need. However, we also have an size column. We would call this column a filter_column since it is not used for searching directly but can be used to filter the search results.

You can read more about filter queries here.

Next Steps

Now that you understand the basics of datasets in Artemis Search, learn how to: