Understanding Datasets
Learn about datasets in Artemis Search and how they power intelligent searches
We have provided a limited sample dataset for you to explore with. It contains ~19,000 company descriptions. In the case you are unable to find matches for your searches, we recommend uploading a bigger dataset.
What is a Dataset?
A Dataset stores the content that Artemis Search will search over. Datasets must be pandas dataframes exported as a parquet file with an “embedding” column containing OpenAI text-large-3 embeddings and a “tag” column with associated string values.
Artemis Search only searches through activated datasets.
Key Concepts
Embeddings
Vector representations of text, allowing for semantic similarity comparisons.
Tags
Metadata associated with each embedding, returned as content in search results. This is the only column returned in search results.
Filter Columns
Columns in the dataset that can be used to filter the search results.
Project Association
Datasets are linked to specific projects for organized search tasks.
Dataset Activation
Within a project, you can have multiple datasets, but only one can be active at a time. The active dataset is the one used for search queries in that project.
At least one dataset must be active for a project to be operational and allow searches.
Creating a Dataset
Datasets are simple to prepare but do require a few steps. Since there are many different workflows for creating datasets, we will present an example workflow for creating a dataset.
We are happy to help prepare your dataset for you if you reach out to us at pallavi@artemisar.com.
Background
Suppose we have the following dataset stored in a Pandas dataframe:
company_description | company_name | id |
---|---|---|
’Acme is a startup that makes widgets' | 'Acme’ | 1 |
’Wayne Enterprises is a startup that makes widgets' | 'Wayne Enterprises’ | 2 |
’Parker Industries is a startup that makes widgets' | 'Parker Industries’ | 3 |
Example Workflow
Choosing Source Data
Artemis Search datasets consist of embeddings of the text you want to search over as well as associated string tags for each embedding. These tags could represent IDs, names, or any other unique information associated with your embedding.
In our case, it would make the most sense to use company_description
as the text we are embedding and the id
as the tags. This choice makes sense since we want to be able to search over the company descriptions and the ids uniquely identify each company.
Preparing the Dataset
Ultimately, we need to end up with Pandas dataframe exported as a parquet file where one column is ‘embedding’ and the other is ‘tag’.
Given our initial dataframe, we will need to transform the company_description
column into embeddings and the id
column into string tags, and then store the result as a parquet file.
Filter Columns
Filter columns are columns in the dataset that can be referenced in the “filter_query” parameter of a search to filter the search results. Practically, they are any column in the dataset that is not named embedding
or tag
.
These columns cannot be named embedding
or tag
since these are reserved column names.
Example
Consider the following dataset stored in a Pandas dataframe:
embedding | tag | size |
---|---|---|
[…] | ’Acme’ | 1 |
[…] | ’Parker Industries’ | 3 |
You can see we have the embedding
and tag
columns that we need. However, we also have an size
column. We would call this column a filter_column
since it is not used for searching directly but can be used to filter the search results.
You can read more about filter queries here.
Next Steps
Now that you understand the basics of datasets in Artemis Search, learn how to: