Doc/replace segments with docs (#27)

* feat: replace segments with chunks

* feat: replace Dataset/Datasets with Knowledge

* feat: replace dataset with knowledge

* feat: replace datasets with knowledge

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/retrieval.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/application/prompt-engineering/conversation-application.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/README.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/maintain-dataset-via-api.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/datasets/sync-from-notion.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/application/prompt-engineering/text-generation-application.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/use-cases/create-an-ai-chatbot-with-business-data-in-minutes.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/getting-started/faq/llms-use-faq.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/hybrid-search.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/hybrid-search.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/rerank.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/retrieval.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/retrieval.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/retrieval.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/retrieval.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

* Update en/advanced/retrieval-augment/retrieval.md

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>

---------

Co-authored-by: Joshua <138381132+joshua20231026@users.noreply.github.com>
pull/29/head
crazywoola 2023-12-11 14:36:18 +08:00 committed by GitHub
parent 2ef8b970a5
commit 6bdf387722
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
28 changed files with 110 additions and 106 deletions

View File

@ -46,9 +46,9 @@
* [Hybrid Search](advanced/retrieval-augment/hybrid-search.md)
* [Rerank](advanced/retrieval-augment/rerank.md)
* [Retrieval](advanced/retrieval-augment/retrieval.md)
* [Datasets\&Index](advanced/datasets/README.md)
* [Knowledge\&Index](advanced/datasets/README.md)
* [Sync from Notion](advanced/datasets/sync-from-notion.md)
* [Maintain Datasets Via Api](advanced/datasets/maintain-dataset-via-api.md)
* [Maintain Knowledge Via Api](advanced/datasets/maintain-dataset-via-api.md)
* [Plugins](advanced/ai-plugins/README.md)
* [Based on WebApp Template](advanced/ai-plugins/based-on-frontend-templates.md)
* [Model Configuration](advanced/model-configuration/README.md)

View File

@ -1,13 +1,13 @@
# Datasets\&Index
# Knowledge\&Index
Most language models use outdated training data and have length limitations for the context of each request. For example, GPT-3.5 is trained on corpora from 2021 and has a limit of approximately 4k tokens per request. This means that developers who want their AI applications to be based on the latest and private context conversations must use techniques like embedding.
Dify' dataset feature allows developers (and even non-technical users) to easily manage datasets and automatically integrate them into AI applications. All you need to do is prepare text content, such as:
Dify' knowledge feature allows developers (and even non-technical users) to easily manage knowledge and automatically integrate them into AI applications. All you need to do is prepare text content, such as:
* Long text content (TXT, Markdown, DOCX, HTML, JSONL, or even PDF files)
* Structured data (CSV, Excel, etc.)
Additionally, we are gradually supporting syncing data from various data sources to datasets, including:
Additionally, we are gradually supporting syncing data from various data sources to knowledge, including:
* GitHub
* Databases
@ -15,12 +15,12 @@ Additionally, we are gradually supporting syncing data from various data sources
* ...
{% hint style="info" %}
**Practice**: If your company wants to build an AI customer service assistant based on existing knowledge bases and product documentation, you can upload the documents to a dataset in Dify and create a conversational application. This might have taken you several weeks in the past and been difficult to maintain continuously.
**Practice**: If your company wants to build an AI customer service assistant based on existing knowledge bases and product documentation, you can upload the documents to a knowledge base in Dify and create a conversational application. This might have taken you several weeks in the past and been difficult to maintain continuously.
{% endhint %}
### Datasets and Documents
### Knowledge and Documents
In Dify, datasets (Datasets) are collections of documents (Documents). A dataset can be integrated as a whole into an application to be used as context. Documents can be uploaded by developers or operations staff, or synced from other data sources (typically corresponding to a file unit in the data source).
In Dify, knowledge bases are collections of documents. A knowledge base can be integrated as a whole into an application to be used as context. Documents can be uploaded by developers or operations staff, or synced from other data sources (typically corresponding to a file unit in the data source).
**Steps to upload a document:**
@ -30,19 +30,19 @@ In Dify, datasets (Datasets) are collections of documents (Documents). A dataset
4. Set metadata for the document
5. Ready to use in the application!
#### How to write a good dataset description
#### How to write a good knowledge description
When multiple datasets are referenced in an application, AI uses the description of the datasets and the user's question to determine which dataset to use to answer the user's question. Therefore, a well-written dataset description can improve the accuracy of AI in selecting datasets.
When multiple knowledge bases are referenced in an application, AI uses the description of the knowledge and the user's question to determine which knowledge base to use to answer the user's question. Therefore, a well-written knowledge description can improve the accuracy of AI in selecting knowledge.
The key to writing a good dataset description is to clearly describe the content and characteristics of the dataset. **It is recommended that the dataset description begin with this: `Useful only when the question you want to answer is about the following: specific description`**. Here is an example of a real estate dataset description:
The key to writing a good knowledge description is to clearly describe the content and characteristics of the knowledge. **It is recommended that the knowledge description begin with this: `Useful only when the question you want to answer is about the following: specific description`**. Here is an example of a real estate knowledge description:
> Useful only when the question you want to answer is about the following: global real estate market data from 2010 to 2020. This data includes information such as the average housing price, property sales volume, and housing types for each city. In addition, this dataset also includes some economic indicators such as GDP and unemployment rate, as well as some social indicators such as population and education level. These indicators can help analyze the trends and influencing factors of the real estate market. With this data, we can understand the development trends of the global real estate market, analyze the changes in housing prices in various cities, and understand the impact of economic and social factors on the real estate market.
> Useful only when the question you want to answer is about the following: global real estate market data from 2010 to 2020. This data includes information such as the average housing price, property sales volume, and housing types for each city. In addition, this knowledge base also includes some economic indicators such as GDP and unemployment rate, as well as some social indicators such as population and education level. These indicators can help analyze the trends and influencing factors of the real estate market. With this data, we can understand the development trends of the global real estate market, analyze the changes in housing prices in various cities, and understand the impact of economic and social factors on the real estate market.
### Create a dataset
### Create a knowledge
1. Click on datasets in the main navigation bar of Dify. On this page, you can see the existing datasets. Click on "Create Dataset" to enter the creation wizard.
1. Click on knowledge in the main navigation bar of Dify. On this page, you can see the existing knowledge bases. Click on "Create Knowledge" to enter the creation wizard.
2. If you have already prepared your files, you can start by uploading the files.
3. If you haven't prepared your documents yet, you can create an empty dataset first.
3. If you haven't prepared your documents yet, you can create an empty knowledge base first.
### Uploading Documents By upload file
@ -55,7 +55,7 @@ The key to writing a good dataset description is to clearly describe the content
Text Preprocessing and cleaning refers to Dify automatically segmenting and vectorizing your data documents so that user's questions (input) can match relevant paragraphs (Q to P), and generate results.
When uploading a dataset, you need to select a **indexing mode** to specify how data is matched. This affects the accuracy of AI replies.
When uploading a knowledge base, you need to select a **indexing mode** to specify how data is matched. This affects the accuracy of AI replies.
In **High Quality mode**, OpenAI's embedding API is used for higher accuracy in user queries.
@ -76,7 +76,7 @@ Modify Documents For technical reasons, if developers make the following changes
1. Adjust segmentation and cleaning settings
2. Re-upload the file
Dify support customizing the segmented and cleaned text by adding, deleting, and editing paragraphs. You can dynamically adjust your segmentation to make your dataset more accurate. Click **Document --> paragraph --> Edit** in the dataset to modify paragraphs content and custom keywords. Click **Document --> paragraph --> Add segment --> Add a segment** to manually add new paragraph. Or click **Document --> paragraph --> Add segment --> Batch add** to batch add new paragraph.
Dify support customizing the segmented and cleaned text by adding, deleting, and editing paragraphs. You can dynamically adjust your segmentation to make your knowledge more accurate. Click **Document --> paragraph --> Edit** in the knowledge to modify paragraphs content and custom keywords. Click **Document --> paragraph --> Add segment --> Add a segment** to manually add new paragraph. Or click **Document --> paragraph --> Add segment --> Batch add** to batch add new paragraph.
<figure><img src="../../.gitbook/assets/image (3) (1).png" alt=""><figcaption><p>Edit</p></figcaption></figure>
@ -84,30 +84,30 @@ Dify support customizing the segmented and cleaned text by adding, deleting, and
### Disabling and Archiving of Documents
* **Disable, cancel disable**: The dataset supports disabling documents or segments that you temporarily do not want indexed. In the dataset's document list, click the Disable button and the document will be disabled. You can also click the Disable button in the document details to disable the entire document or a segment. Disabled documents will not be indexed. To cancel the disable, click Enable on a disabled document.
* **Archive, Unarchive:** Some unused old document data can be archived if you don't want to delete it. After archiving, the data can only be viewed or deleted, not edited. In the document list of the dataset, click the Archive button to archive the document. Documents can also be archived in the document details page. Archived documents will not be indexed. Archived documents can also be unarchived by clicking the Unarchive button.
* **Disable, cancel disable**: The knowledge supports disabling documents or chunks that you temporarily do not want indexed. In the knowledge's document list, click the Disable button and the document will be disabled. You can also click the Disable button in the document details to disable the entire document or a segment. Disabled documents will not be indexed. To cancel the disable, click Enable on a disabled document.
* **Archive, Unarchive:** Some unused old document data can be archived if you don't want to delete it. After archiving, the data can only be viewed or deleted, not edited. In the document list of the knowledge, click the Archive button to archive the document. Documents can also be archived in the document details page. Archived documents will not be indexed. Archived documents can also be unarchived by clicking the Unarchive button.
### Maintain Datasets via API
### Maintain Knowledge via API
TODO
### Dataset Settings
### Knowledge Settings
Click **Settings** in the left navigation of the dataset. You can change the following settings for the dataset:
Click **Settings** in the left navigation of the knowledge. You can change the following settings for the knowledge:
* Dataset **name** for identifying a dataset
* Dataset **description** to allow AI to better use the dataset appropriately. If the description is empty, Dify's automatic indexing strategy will be used.
* **Permissions** can be set to Only Me or All Team Members. Those without permissions cannot view and edit the dataset.
* Knowledge **name** for identifying a knowledge base
* Knowledge **description** to allow AI to better use the knowledge appropriately. If the description is empty, Dify's automatic indexing strategy will be used.
* **Permissions** can be set to Only Me or All Team Members. Those without permissions cannot view and edit the knowledge.
* **Indexing mode**: In High Quality mode, OpenAI's embedding interface will be called to process and provide higher accuracy when users query. In Economic mode, offline vector engines, keyword indexing, etc. will be used to reduce accuracy without consuming tokens.
Note: Upgrading the indexing mode from Economic to High Quality will incur additional token consumption. Downgrading from High Quality to Economic will not consume tokens.
### Integrate into Applications
Once the dataset is ready, it needs to be integrated into the application. When the AI application processes will automatically use the associated dataset content as a reference context.
Once the knowledge base is ready, it needs to be integrated into the application. When the AI application processes will automatically use the associated knowledge content as a reference context.
1. Go to the application - Prompt Arrangement page
2. In the context options, select the dataset you want to integrate
2. In the context options, select the knowledge you want to integrate
3. Save the settings to complete the integration
### Q\&A
@ -116,11 +116,11 @@ Once the dataset is ready, it needs to be integrated into the application. When
A: If your PDF parsing appears garbled under certain formatted contents, you could consider converting the PDF to Markdown format, which currently offers higher accuracy, or you could reduce the use of images, tables, and other formatted content in the PDF. We are researching ways to optimize the experience of using PDFs.
**Q: How does the consumption mechanism of context work?** A: With a dataset added, each query will consume segmented content (currently embedding two segments) + question + prompt + chat history combined. However, it will not exceed model limitations, such as 4096.
**Q: How does the consumption mechanism of context work?** A: With a knowledge base added, each query will consume segmented content (currently embedding two chunks) + question + prompt + chat history combined. However, it will not exceed model limitations, such as 4096.
**Q: Where does the embedded dataset appear when asking questions?** A: It will be embedded as context before the question.
**Q: Where does the embedded knowledge appear when asking questions?** A: It will be embedded as context before the question.
**Q: Is there any priority between the added dataset and OpenAI's answers?** A: The dataset serves as context and is used together with questions for LLM to understand and answer; there is no priority relationship.
**Q: Is there any priority between the added knowledge and OpenAI's answers?** A: The knowledge serves as context and is used together with questions for LLM to understand and answer; there is no priority relationship.
**Q: Why can I hit in test but not in application?** A: You can troubleshoot issues by following these steps:
@ -131,4 +131,4 @@ A: If your PDF parsing appears garbled under certain formatted contents, you cou
**Q: Will APIs related to hit testing be opened up so that dify can access knowledge bases and implement dialogue generation using custom models?** A: We plan to open up Webhooks later on; however, there are no current plans for this feature. You can achieve your requirements by connecting to any vector database.
**Q: How do I add multiple datasets?** A: Due to short-term performance considerations, we currently only support one dataset. If you have multiple sets of data, you can upload them within the same dataset for use.
**Q: How do I add multiple knowledge bases?** A: Due to short-term performance considerations, we currently only support one knowledge base. If you have multiple sets of data, you can upload them within the same knowledge base for use.

View File

@ -1,20 +1,20 @@
# Maintain Datasets via API
# Maintain Knowledge via API
> Authentication, invocation method and application Service API remain consistent. The difference is that a dataset API token can operate on all datasets.
> Authentication, invocation method and application Service API remain consistent. The difference is that a knowledge API token can operate on all knowledge bases.
### Benefits of Using the Dataset API
* Sync your data systems to Dify datasets to create powerful workflows.
* Provide dataset list and document list APIs as well as detail query interfaces, to facilitate building your own data management page.
### Benefits of Using the Knowledge API
* Sync your data systems to Dify knowledge to create powerful workflows.
* Provide knowledge list and document list APIs as well as detail query interfaces, to facilitate building your own data management page.
* Support both plain text and file uploads/updates documents, as well as batch additions and modifications, to simplify your sync process.
* Reduce manual document handling and syncing time, improving visibility of Dify's software and services.
### How to use
Please go to the dataset page, you can switch tap to the API page in the navigation on the left side. On this page, you can view the API documentation provided by Dify and manage credentials for accessing the Dataset API.
Please go to the knowledge page, you can switch tap to the API page in the navigation on the left side. On this page, you can view the API documentation provided by Dify and manage credentials for accessing the Knowledge API.
<figure><img src="../../.gitbook/assets/dataset-api-token.png" alt=""><figcaption><p>Dataset API Document</p></figcaption></figure>
<figure><img src="../../.gitbook/assets/dataset-api-token.png" alt=""><figcaption><p>Knowledge API Document</p></figcaption></figure>
## **Create Empty Dataset**
## **Create Empty Knowledge**
**`POST /datasets`**
@ -30,7 +30,7 @@ curl --location --request POST 'https://api.dify.ai/v1/datasets' \
```
#### **List of Datasets**
#### **List of Knowledge**
```
@ -143,9 +143,9 @@ curl 'https://api.dify.ai/v1/datasets/aac47674-31a8-4f12-aab2-9603964c4789/docum
- `document_indexing`document is in indexing status
- `provider_not_initialize` Embedding model is not configured
- `not_found`document not exist
- `dataset_name_duplicate` have existing dataset name
- `dataset_name_duplicate` have existing knowledge name
- `provider_quota_exceeded`The model quota has exceeded the limit
- `dataset_not_initialized`The dataset has not been initialized
- `dataset_not_initialized`The knowledge has not been initialized
- `unsupported_file_type`Unsupported file type
- support file typetxt, markdown, md, pdf, html, htm, xlsx, docx, csv
- `too_many_files`The number of files is too large, and only single file upload is temporarily supported

View File

@ -1,17 +1,17 @@
# Sync from Notion
Dify dataset supports importing from Notion and setting up **Sync** so that data is automatically synced to Dify after updates in Notion.
Dify knowledge supports importing from Notion and setting up **Sync** so that data is automatically synced to Dify after updates in Notion.
### Authorization verification
1. When creating a dataset, select the data source, click **Sync from Notion--Go to connect**, and complete the authorization verification according to the prompt.
1. When creating a knowledge base, select the data source, click **Sync from Notion--Go to connect**, and complete the authorization verification according to the prompt.
2. You can also: click **Settings--Data Sources--Add a Data Source**, click Notion Source **Connect** to complete authorization verification.
<figure><img src="../../.gitbook/assets/notion-connect.png" alt=""><figcaption><p>Connect Notion</p></figcaption></figure>
### Import Notion data
After completing authorization verification, go to the dataset creation page, click **Sync from Notion**, and select the required authorization page to import.
After completing authorization verification, go to the knowledge creation page, click **Sync from Notion**, and select the required authorization page to import.
### Segmentation and cleaning
@ -21,7 +21,7 @@ _**Note: Images and files are not currently supported for import. Table data wil
### Sync Notion data
If your Notion content has been modified, you can click Sync directly on the Dify dataset document list page to sync the data with one click(Please note that each time you click, the current content will be synchronized). This step requires token consumption.
If your Notion content has been modified, you can click Sync directly on the Dify knowledge document list page to sync the data with one click(Please note that each time you click, the current content will be synchronized). This step requires token consumption.
<figure><img src="../../.gitbook/assets/sync-notion-data.png" alt=""><figcaption><p>Sync Notion data</p></figcaption></figure>
@ -73,4 +73,4 @@ Back to the Dify source code , in the **.env** file configuration related enviro
**NOTION\_CLIENT\_ID**=you-client-id
Once configured, you will be able to utilize Notion data import and sync functions in the dataset section.
Once configured, you will be able to utilize Notion data import and sync functions in the knowledge section.

View File

@ -8,7 +8,7 @@ Please read [.](./ "mention") to complete the development and integration of bas
`app.moderation.input`: End-user input content review extension point. It is used to review the content of variables passed in by end-users and the input content of dialogues in conversational applications.
`app.moderation.output`: LLM output content review extension point. It is used to review the content output by LLM. When the LLM output is streaming, the content will be requested by the API in segments of 100 characters to avoid delays in review when the output content is lengthy.
`app.moderation.output`: LLM output content review extension point. It is used to review the content output by LLM. When the LLM output is streaming, the content will be requested by the API in chunks of 100 characters to avoid delays in review when the output content is lengthy.
### `app.moderation.input`

View File

@ -23,6 +23,10 @@ Dify classifies models into 4 types, each for different uses:
> Provider: OpenAI.
1. System Reasoning Model. In the created application, this type of model is used. Smart chat, dialogue name generation, and next question suggestions also use reasoning models.
2. Embedding Model. In the knowledge, this type of model is used to embedding segmented documents. In applications that use data sets, this type of model is also used to process user questions as Embedding.
3. Speech-to-Text model. In conversational applications, this type of model is used to convert speech to text.
Dify plans to add more LLM providers as technology and user needs evolve.
## Hosted Model Trial Service&#x20;

View File

@ -38,7 +38,7 @@ Well, before you try the new mode, you should be aware of some essential element
<img src="../../.gitbook/assets/Context.png" alt="" data-size="line">
When users input a query, the app processes the query as search criteria for the dataset. The organized results from the search then replace the variable `Context`, allowing the LLM to reference the content for its response.
When users input a query, the app processes the query as search criteria for the knowledge. The organized results from the search then replace the variable `Context`, allowing the LLM to reference the content for its response.
@ -93,7 +93,7 @@ It is used to filter the text fragments with the highest similarity to the user'
**Score Threshold:** The value is a floating-point number from 0 to 1, with two decimal places.
It is used to set the similarity threshold for text segment selection, i.e., it only recalls text segments that exceed the set score. By default, the system turns this setting off, meaning there's no filtering based on the similarity value of the recalled text segments. When activated, the default value is 0.7. We recommend keeping this setting deactivated by default. If you have more stringent reply requirements, you can set a higher value, though it's not advisable to set it excessively high.
It is used to set the similarity threshold for text segment selection, i.e., it only recalls text chunks that exceed the set score. By default, the system turns this setting off, meaning there's no filtering based on the similarity value of the recalled text chunks. When activated, the default value is 0.7. We recommend keeping this setting deactivated by default. If you have more stringent reply requirements, you can set a higher value, though it's not advisable to set it excessively high.
### 3. **Stop\_Sequences**

View File

@ -143,7 +143,7 @@ Dify has collaborated with some model providers for joint deep optimization of s
### **Parameter Definitions**&#x20;
* **Context**: Used to insert related text from the dataset as context into the complete prompts.&#x20;
* **Context**: Used to insert related text from the knowledge as context into the complete prompts.&#x20;
* **Pre-prompt**: Pre-prompts arranged in the **Basic Mode** are inserted into the complete prompts.&#x20;
* **History**: When building a chat application using text generation models, the system inserts the user's conversation history as context into the complete prompts. Since some models may respond differently to role prefixes, you can also modify the role prefix name in the conversation history settings, for example, changing the name "Assistant" to "AI".
* **Query**: The query content represents variable values used to insert questions that users input during the chat.

View File

@ -37,23 +37,23 @@ Different search systems each excel at uncovering various subtle connections wit
## Vector Search&#x20;
Definition: Vector Search involves generating query embeddings and then searching for text segments that most closely match these embeddings in terms of vector representation.
Definition: Vector Search involves generating query embeddings and then searching for text chunks that most closely match these embeddings in terms of vector representation.
<figure><img src="../../.gitbook/assets/screenshot-20231119-174228.png" alt=""><figcaption><p>Settings for Vector Search</p></figcaption></figure>
**TopK:** This setting is used to filter text segments that have the highest similarity to the user's query. The system also dynamically adjusts the number of segments based on the context window size of the selected model. The default value for this setting is 3.
**TopK:** This setting is used to filter text chunks that have the highest similarity to the user's query. The system also dynamically adjusts the number of chunks based on the context window size of the selected model. The default value for this setting is 3.
**Score Threshold:** This setting is used to establish a similarity threshold for the selection of text segments. It means that only text segments exceeding the set score are recalled. By default, this setting is turned off, meaning that the system does not filter the similarity values of the recalled text segments. When activated, the default value is set to 0.5.
**Score Threshold:** This setting is used to establish a similarity threshold for the selection of text chunks. It means that only text chunks exceeding the set score are recalled. By default, this setting is turned off, meaning that the system does not filter the similarity values of the recalled text chunks. When activated, the default value is set to 0.5.
**Rerank Model:** After configuring the Rerank model's API key on the "Model Provider" page, you can enable the "Rerank Model" in the search settings. The system then performs a semantic re-ranking of the document results that have been recalled after semantic search, optimizing the order of these results. Once the Rerank model is set up, the TopK and Score threshold settings are only effective in the Rerank step.
## Full-Text Search&#x20;
Definition: Full-Text Search involves indexing all the words in a document, enabling users to query any term and retrieve text segments that contain these terms.
Definition: Full-Text Search involves indexing all the words in a document, enabling users to query any term and retrieve text chunks that contain these terms.
<figure><img src="../../.gitbook/assets/screenshot-20231119-174610.png" alt=""><figcaption><p>Settings for Full-Text Search</p></figcaption></figure>
**TopK:** This setting is utilized to select text segments that most closely match the user's query in terms of similarity. The system also dynamically adjusts the number of segments based on the context window size of the chosen model. The default value for TopK is set at 3.
**TopK:** This setting is utilized to select text chunks that most closely match the user's query in terms of similarity. The system also dynamically adjusts the number of chunks based on the context window size of the chosen model. The default value for TopK is set at 3.
**Rerank Model:** After configuring the API key for the Rerank model on the "Model Provider" page, you can activate the "Rerank Model" in the search settings. The system will then perform a semantic re-ranking of the document results retrieved through full-text search, optimizing the order of these results. Once the Rerank model is configured, the TopK and any Score threshold settings will only be effective during the Rerank step.
@ -63,18 +63,18 @@ Hybrid Search operates by concurrently executing Full-Text Search and Vector Sea
<figure><img src="../../.gitbook/assets/screenshot-20231119-175216.png" alt=""><figcaption><p>Settings for Hybrid Search</p></figcaption></figure>
**TopK:** This setting is used for filtering text segments that have the highest similarity to the user's query. The system will dynamically adjust the number of segments based on the context window size of the model in use. The default value for TopK is set at 3.
**TopK:** This setting is used for filtering text chunks that have the highest similarity to the user's query. The system will dynamically adjust the number of chunks based on the context window size of the model in use. The default value for TopK is set at 3.
**Rerank Model:** After configuring the Rerank model's API key on the "Model Supplier" page, you can enable the "Rerank Model" in the search settings. The system will perform a semantic re-ranking of the document results retrieved through hybrid search, thereby optimizing the order of these results. Once the Rerank model is set up, the TopK and any Score threshold settings are only applicable during the Rerank step.
## Setting the Search Mode When Creating a Dataset&#x20;
## Setting the Search Mode When Creating a Knowledge&#x20;
To set the search mode when creating a dataset, navigate to the "Datasets -> Create Dataset" page. There, you can configure different search modes in the retrieval settings section.
To set the search mode when creating a knowledge base, navigate to the "Knowledge -> Create Knowledge" page. There, you can configure different search modes in the retrieval settings section.
<figure><img src="../../.gitbook/assets/screenshot-20231119-175958.png" alt=""><figcaption><p>Setting the Search Mode When Creating a Dataset</p></figcaption></figure>
<figure><img src="../../.gitbook/assets/screenshot-20231119-175958.png" alt=""><figcaption><p>Setting the Search Mode When Creating a Knowledge base</p></figcaption></figure>
## Modifying the Search Mode in Prompt Engineering
You can modify the search mode during application creation by navigating to the "Prompt Engineering -> Context -> Select Dataset -> Settings" page. This allows for adjustments to different search modes within the prompt arrangement phase.
You can modify the search mode during application creation by navigating to the "Prompt Engineering -> Context -> Select Knowledge -> Settings" page. This allows for adjustments to different search modes within the prompt arrangement phase.
<figure><img src="../../.gitbook/assets/screenshot-20231119-182704.png" alt=""><figcaption><p>Modifying the Search Mode in Prompt Engineering</p></figcaption></figure>

View File

@ -12,9 +12,9 @@ In most cases, there is an initial search before rerank because calculating the
However, rerank is not only applicable to merging results from different search systems. Even in a single search mode, introducing a rerank step can effectively improve the recall of documents, such as adding semantic rerank after keyword search.
In practice, apart from normalizing results from multiple queries, we usually limit the number of text segments passed to the large model before providing the relevant text segments (i.e., TopK, which can be set in the rerank model parameters). This is done because the input window of the large model has size limitations (generally 4K, 8K, 16K, 128K Token counts), and you need to select an appropriate segmentation strategy and TopK value based on the size limitation of the chosen model's input window.
In practice, apart from normalizing results from multiple queries, we usually limit the number of text chunks passed to the large model before providing the relevant text chunks (i.e., TopK, which can be set in the rerank model parameters). This is done because the input window of the large model has size limitations (generally 4K, 8K, 16K, 128K Token counts), and you need to select an appropriate segmentation strategy and TopK value based on the size limitation of the chosen model's input window.
It should be noted that even if the model's context window is sufficiently large, too many recalled segments may introduce content with lower relevance, thus degrading the quality of the answer. Therefore, the TopK parameter for rerank is not necessarily better when larger.
It should be noted that even if the model's context window is sufficiently large, too many recalled chunks may introduce content with lower relevance, thus degrading the quality of the answer. Therefore, the TopK parameter for rerank is not necessarily better when larger.
Rerank is not a substitute for search technology but an auxiliary tool to enhance existing search systems. **Its greatest advantage is that it not only offers a simple and low-complexity method to improve search results but also allows users to integrate semantic relevance into existing search systems without the need for significant infrastructure modifications.**
@ -22,11 +22,11 @@ Rerank is not a substitute for search technology but an auxiliary tool to enhanc
Visit [https://cohere.com/rerank](https://cohere.com/rerank), register on the page, and apply for usage rights for the Rerank model to obtain the API key.
## Setting the Rerank Model in Dataset Search Mode&#x20;
## Setting the Rerank Model in Knowledge Search Mode&#x20;
Access the Rerank settings by navigating to “Datasets -> Create Dataset -> Retrieval Settings”. Besides setting Rerank during dataset creation, you can also modify the Rerank configuration in the settings of an already created dataset, and change the Rerank configuration in the dataset recall mode settings of application arrangement.
Access the Rerank settings by navigating to “Knowledge -> Create Knowledge -> Retrieval Settings”. Besides setting Rerank during knowledge creation, you can also modify the Rerank configuration in the settings of an already created knowledge base, and change the Rerank configuration in the knowledge recall mode settings of application arrangement.
<figure><img src="../../.gitbook/assets/screenshot-20231119-191016.png" alt=""><figcaption><p>Setting the Rerank Model in Dataset Search Mode </p></figcaption></figure>
<figure><img src="../../.gitbook/assets/screenshot-20231119-191016.png" alt=""><figcaption><p>Setting the Rerank Model in Knowledge Search Mode </p></figcaption></figure>
**TopK:** Used to set the number of relevant documents returned after Rerank.&#x20;

View File

@ -1,6 +1,6 @@
# Retrieval
When users build knowledge base Q\&A AI applications, if multiple datasets are associated within the application, Dify supports two retrieval modes: N-to-1 retrieval and Multi-path retrieval.
When users build knowledge base Q\&A AI applications, if multiple knowledge bases are associated within the application, Dify supports two retrieval modes: N-to-1 retrieval and Multi-path retrieval.
<figure><img src="../../.gitbook/assets/screenshot-20231119-191531.png" alt=""><figcaption><p>Retrieval Settings</p></figcaption></figure>
@ -8,21 +8,21 @@ When users build knowledge base Q\&A AI applications, if multiple datasets are a
### **N-to-1 Retrieval**&#x20;
Based on user intent and dataset description, the Agent independently determines and selects the most matching single dataset for querying relevant text. This mode is suitable for applications with distinct datasets and a smaller number of datasets. N-to-1 retrieval relies on the model's inference capability to choose the most relevant dataset based on user intent. When inferring the dataset, the dataset serves as a tool for the Agent, chosen through intent inference; the tool description is essentially the dataset description.
Based on user intent and knowledge description, the Agent independently determines and selects the most matching single knowledge base for querying relevant text. This mode is suitable for applications with distinct knowledge and a smaller number of knowledge bases. N-to-1 retrieval relies on the model's inference capability to choose the most relevant knowledge base based on user intent. When inferring the knowledge, the knowledge serves as a tool for the Agent, chosen through intent inference; the tool description is essentially the knowledge description.
When users upload datasets, the system automatically creates a summary description of each dataset. To achieve the best retrieval results in this mode, you can view the system-generated summary description under “Datasets -> Settings -> Dataset Description” and check if this content clearly summarizes the dataset's content.
When users upload knowledge, the system automatically creates a summary description of each knowledge base. To achieve the best retrieval results in this mode, you can view the system-generated summary description under “Knowledge -> Settings -> Knowledge Description” and check if this content clearly summarizes the knowledge's content.
Here is the technical flowchart for N-to-1 retrieval:
<figure><img src="../../.gitbook/assets/spaces_CdDIVDY6AtAz028MFT4d_uploads_LgAOVtxy9kQ0B8e2qaQl_image.webp" alt=""><figcaption><p>N-to-1 Retrieval </p></figcaption></figure>
Therefore, this mode's recall effectiveness can be impacted when there are too many datasets or when the dataset descriptions lack sufficient distinction. This mode is more suitable for applications with fewer datasets.&#x20;
Therefore, this mode's recall effectiveness can be impacted when there are too many knowledge bases or when the knowledge descriptions lack sufficient distinction. This mode is more suitable for applications with fewer knowledge bases.&#x20;
Tip: OpenAI Function Call already supports multiple tool calls, and Dify plans to upgrade this mode to "N-to-M retrieval" in future versions.
### Multi-path Retrieval
Based on user intent, this mode matches all datasets simultaneously, queries relevant text segments from multiple datasets, and after a re-ranking step, selects the best results matching the user's question from the multi-path query results. Configuring the Rerank model API is required. In Multi-path retrieval mode, the search engine retrieves text content related to the user's query from all datasets associated with the application, merges the results from multi-path recall, and re-ranks the retrieved documents semantically using the Rerank model.
Based on user intent, this mode matches all knowledge bases simultaneously, queries relevant text chunks from multiple knowledge bases, and after a re-ranking step, selects the best results matching the user's question from the multi-path query results. Configuring the Rerank model API is required. In Multi-path retrieval mode, the search engine retrieves text content related to the user's query from all knowledge bases associated with the application, merges the results from multi-path recall, and re-ranks the retrieved documents semantically using the Rerank model.
In Multi-path retrieval mode, configuring the Rerank model is necessary. How to configure the Rerank model: 🔗
@ -30,4 +30,4 @@ Here is the technical flowchart for Multi-path retrieval:&#x20;
<figure><img src="../../.gitbook/assets/spaces_CdDIVDY6AtAz028MFT4d_uploads_xfMNnsyD506TOoynHdgU_image.webp" alt=""><figcaption><p>Multi-path retrieval</p></figcaption></figure>
As Multi-path retrieval does not rely on the model's inferencing capability or dataset descriptions, this mode can achieve higher quality recall results in multi-dataset searches. Additionally, incorporating the Rerank step can effectively improve document recall. Therefore, when creating a knowledge base Q\&A application associated with multiple datasets, we recommend configuring the retrieval mode as Multi-path retrieval.
As Multi-path retrieval does not rely on the model's inferencing capability or knowledge descriptions, this mode can achieve higher quality recall results in multi-knowledge searches. Additionally, incorporating the Rerank step can effectively improve document recall. Therefore, when creating a knowledge base Q\&A application associated with multiple knowledge bases, we recommend configuring the retrieval mode as Multi-path retrieval.

View File

@ -14,7 +14,7 @@ You can choose one or all of them to support your AI application development.
Dify offers two types of applications: text generation and conversational. More application paradigms may appear in the future (we should keep up-to-date), and the ultimate goal of Dify is to cover more than 80% of typical LLM application scenarios. The differences between text generation and conversational applications are shown in the table below:
<table><thead><tr><th width="199.33333333333331"> </th><th>Text Generation</th><th>Conversational</th></tr></thead><tbody><tr><td>WebApp Interface</td><td>Form + Results</td><td>Chat style</td></tr><tr><td>API Endpoint</td><td><code>completion-messages</code></td><td><code>chat-messages</code></td></tr><tr><td>Interaction Mode</td><td>One question and one answer</td><td>Multi-turn dialogue</td></tr><tr><td>Streaming results return</td><td>Supported</td><td>Supported</td></tr><tr><td>Context Preservation</td><td>Current time</td><td>Continuous</td></tr><tr><td>User input form</td><td>Supported</td><td>Supported</td></tr><tr><td>Datasets&#x26;Plugins</td><td>Supported</td><td>Supported</td></tr><tr><td>AI opening remarks</td><td>Not supported</td><td>Supported</td></tr><tr><td>Scenario example</td><td>Translation, judgment, indexing</td><td>Chat or everything</td></tr></tbody></table>
<table><thead><tr><th width="199.33333333333331"> </th><th>Text Generation</th><th>Conversational</th></tr></thead><tbody><tr><td>WebApp Interface</td><td>Form + Results</td><td>Chat style</td></tr><tr><td>API Endpoint</td><td><code>completion-messages</code></td><td><code>chat-messages</code></td></tr><tr><td>Interaction Mode</td><td>One question and one answer</td><td>Multi-turn dialogue</td></tr><tr><td>Streaming results return</td><td>Supported</td><td>Supported</td></tr><tr><td>Context Preservation</td><td>Current time</td><td>Continuous</td></tr><tr><td>User input form</td><td>Supported</td><td>Supported</td></tr><tr><td>Knowledge&#x26;Plugins</td><td>Supported</td><td>Supported</td></tr><tr><td>AI opening remarks</td><td>Not supported</td><td>Supported</td></tr><tr><td>Scenario example</td><td>Translation, judgment, indexing</td><td>Chat or everything</td></tr></tbody></table>
### Steps to Create an Application
@ -26,7 +26,7 @@ We provide some templates in the application creation interface, and you can cli
### Creating from a Configuration File
If you have obtained a template from the community or someone else, you can click to create from an application configuration file. Uploading the file will load most of the settings from the other party's application (but not the datasets at present).
If you have obtained a template from the community or someone else, you can click to create from an application configuration file. Uploading the file will load most of the settings from the other party's application (but not the knowledge at present).
### Your Application

View File

@ -9,7 +9,7 @@ Dify offers a "Backend-as-a-Service" API, providing numerous benefits to AI appl
* Well-encapsulated original LLM APIs
* Effortlessly switch between LLM providers and centrally manage API keys
* Operate applications visually, including log analysis, annotation, and user activity observation
* Continuously provide more tools, plugins, and datasets
* Continuously provide more tools, plugins, and knowledge
### How to use
@ -17,7 +17,7 @@ Choose an application, and find the API Access in the left-side navigation of th
<figure><img src="../.gitbook/assets/API Access.png" alt=""><figcaption><p>API document</p></figcaption></figure>
You can create multiple access credentials for an application to deliver to different users or developers. This means that API users can use the AI capabilities provided by the application developer, but the underlying Prompt engineering, datasets, and tool capabilities are encapsulated.
You can create multiple access credentials for an application to deliver to different users or developers. This means that API users can use the AI capabilities provided by the application developer, but the underlying Prompt engineering, knowledge, and tool capabilities are encapsulated.
{% hint style="warning" %}
In best practices, API keys should be called through the backend, rather than being directly exposed in plaintext within frontend code or requests. This helps prevent your application from being abused or attacked.

View File

@ -48,7 +48,7 @@ And then edit the opening remarks:
**2.2 Adding Context**
If an application wants to generate content based on private contextual conversations, it can use our [dataset](../../advanced/datasets/) feature. Click the "Add" button in the context to add a dataset.
If an application wants to generate content based on private contextual conversations, it can use our [knowledge](../../advanced/datasets/) feature. Click the "Add" button in the context to add a knowledge base.
![](<../../.gitbook/assets/image (9).png>)

View File

@ -1,6 +1,6 @@
# External-data-tool
Previously, [datasets](../../advanced/datasets/ "mention") allowed developers to directly upload long texts in various formats and structured data to build datasets, enabling AI applications to converse based on the latest context uploaded by users. With this update, the external data tool empowers developers to use their own search capabilities or external data such as internal knowledge bases as the context for LLMs. This is achieved by extending APIs to fetch external data and embedding it into Prompts. Compared to uploading datasets to the cloud, using external data tools offers significant advantages in ensuring the security of private data, customizing searches, and obtaining real-time data.
Previously, [knowledge](../../advanced/datasets/ "mention") allowed developers to directly upload long texts in various formats and structured data to build knowledge, enabling AI applications to converse based on the latest context uploaded by users. With this update, the external data tool empowers developers to use their own search capabilities or external data such as internal knowledge bases as the context for LLMs. This is achieved by extending APIs to fetch external data and embedding it into Prompts. Compared to uploading knowledge to the cloud, using external data tools offers significant advantages in ensuring the security of private data, customizing searches, and obtaining real-time data.
## What does it do?

View File

@ -36,7 +36,7 @@ The prompt we are filling in here is: `Translate the content to: {{language}}. T
**2.2 Adding Context**
If the application wants to generate content based on private contextual conversations, our [dataset](../../advanced/datasets/) feature can be used. Click the "Add" button in the context to add a dataset.
If the application wants to generate content based on private contextual conversations, our [knowledge](../../advanced/datasets/) feature can be used. Click the "Add" button in the context to add a knowledge base.
![](<../../.gitbook/assets/image (12).png>)

View File

@ -6,7 +6,7 @@ When we talk to large natural language models, we often encounter situations whe
<figure><img src="../.gitbook/assets/image (61).png" alt=""><figcaption></figcaption></figure>
Chat supports the use of plugins and datasets.
Chat supports the use of plugins and knowledge.
### Use plugins
@ -42,16 +42,16 @@ Configured entry:
<figure><img src="../.gitbook/assets/image (18).png" alt=""><figcaption></figcaption></figure>
### Use datasets
### Use knowledge
Chat supports datasets. After selecting the datasets, the questions asked by the user are related to the content of the data set, and the model will find the answer from the data set.
Chat supports knowledge. After selecting the knowledge, the questions asked by the user are related to the content of the data set, and the model will find the answer from the data set.
We can select the datasets needed for this conversation before the conversation starts.
We can select the knowledge needed for this conversation before the conversation starts.
<figure><img src="../.gitbook/assets/image (5).png" alt=""><figcaption></figcaption></figure>
### The process of thinking
The thinking process refers to the process of the model using plugins and datasets. We can see the thought process in each answer.
The thinking process refers to the process of the model using plugins and knowledge. We can see the thought process in each answer.
<figure><img src="../.gitbook/assets/image (23).png" alt=""><figcaption></figcaption></figure>

View File

@ -133,7 +133,7 @@ The database, configured storage, and vector database data need to be backed up.
`127.0.0.1` is the internal address of the container, and the server address configured by Dify requires the host LAN IP address.
### 11. How to solve the size and quantity limitations for uploading dataset documents in the local deployment version
### 11. How to solve the size and quantity limitations for uploading knowledge documents in the local deployment version
You can refer to the official website environment variable description document to configure:&#x20;

View File

@ -10,7 +10,7 @@
Because in natural language processing, longer text outputs usually require longer computation time and more computing resources. Therefore, limiting the length of the output text can reduce the computational cost and time to some extent. For example, set: max\_tokens=500, which means that only the first 500 tokens of the output text are considered, and the part exceeding this length will be discarded. The purpose of doing so is to ensure that the length of the output text does not exceed the acceptable range of the LLM, while making full use of computing resources to improve the efficiency of the model. On the other hand, more often limiting max\_tokens can increase the length of the prompt, such as the limit of gpt-3.5-turbo is 4097 tokens, if you set max\_tokens=4000, then only 97 tokens are left for the prompt, and an error will be reported if exceeded.
### 3. How to split long text data in the dataset reasonably?
### 3. How to split long text data in the knowledge reasonably?
In some natural language processing applications, text is often split into paragraphs or sentences for better processing and understanding of semantic and structural information in the text. The minimum splitting unit depends on the specific task and technical implementation. For example:
@ -20,7 +20,7 @@ In some natural language processing applications, text is often split into parag
Finally, experiments and evaluations are still needed to determine the most suitable embedding technology and splitting unit. The performance of different technologies and splitting units can be compared on the test set to select the optimal scheme.
### 4. What distance function did we use when getting dataset segmentation?
### 4. What distance function did we use when getting knowledge segmentation?
We use [cosine similarity](https://en.wikipedia.org/wiki/Cosine\_similarity). The choice of distance function is usually irrelevant. OpenAI embeddings are normalized to length 1, which means:
@ -76,7 +76,7 @@ You can lower the value of "Max token" in the parameter settings of the Prompt E
A: The default models can be configured under **Settings - Model Provider.** Currently supported text generation LLMs include OpenAI, Azure OpenAl, Anthropic, etc. At the same time, open-source LLMs hosted on Hugging Face, Replicate, xinference, etc. can also be integrated.
### 11. The dataset in Community Edition gets stuck in "Queued" when Q\&A segmentation mode is enabled.
### 11. The knowledge in Community Edition gets stuck in "Queued" when Q\&A segmentation mode is enabled.
Please check if the rate limit has been reached for the Embedding model API key used.
@ -87,7 +87,7 @@ There are two potential solutions if the error "Invalid token" appears:
* Clear the browser cache (cookies, session storage, and local storage) or the app cache on mobile. Then, revisit the app.
* Regenerate the app URL and access the app again with the new URL. This should resolve the "Invalid token" error.
### 13. What are the size limits for uploading dataset documents?
### 13. What are the size limits for uploading knowledge documents?
The maximum size for a single document upload is currently 15MB. There is also a limit of 100 total documents. These limits can be adjusted if you are using a local deployment. Refer to the [documentation](install-faq.md#11.-how-to-solve-the-size-and-quantity-limitations-for-uploading-dataset-documents-in-the-local-depl) for details on changing the limits.
@ -95,11 +95,11 @@ The maximum size for a single document upload is currently 15MB. There is also a
The Claude model does not have its own embedding model. Therefore, the embedding process and other dialog generation like next question suggestions default to using OpenAI keys. This means OpenAI credits are still consumed. You can set different default inference and embedding models under **Settings > Model Provider.**
### 15. Is there any way to control the greater use of dataset data rather than the model's own generation capabilities?
### 15. Is there any way to control the greater use of knowledge data rather than the model's own generation capabilities?
Whether to use a dataset is related to the description of the dataset. Please write the dataset description clearly as much as possible. Please refer to the [documentation](https://docs.dify.ai/advanced/datasets) for details.
Whether to use a knowledge base is related to the description of the knowledge. Please write the knowledge description clearly as much as possible. Please refer to the [documentation](https://docs.dify.ai/advanced/datasets) for details.
### 16. How to better segment the uploaded dataset document in Excel?
### 16. How to better segment the uploaded knowledge document in Excel?
Set the header in the first row, and display the content in each subsequent row. Do not have any additional header settings or complex formatted table content.

View File

@ -259,7 +259,7 @@ Used to store uploaded data set files, team/tenant encryption keys, and other fi
Whether Milvus uses SSL connection, default is false.
#### Dataset Configuration
#### Knowledge Configuration
* UPLOAD_FILE_SIZE_LIMIT:

View File

@ -37,5 +37,5 @@ With the introduction of an LLMOps platform like Dify, the process of developing
Additionally, Dify will provide AI plugin development and integration features, enabling developers to easily create and deploy LLM-based plugins for various applications, further enhancing development efficiency and application value.
**Dify** is an easy-to-use LLMOps platform designed to empower more people to create sustainable, AI-native applications. With visual orchestration for various application types, Dify offers out-of-the-box, ready-to-use applications that can also serve as Backend-as-a-Service APIs. Unify your development process with one API for plugins and datasets integration, and streamline your operations using a single interface for prompt engineering, visual analytics, and continuous improvement.
**Dify** is an easy-to-use LLMOps platform designed to empower more people to create sustainable, AI-native applications. With visual orchestration for various application types, Dify offers out-of-the-box, ready-to-use applications that can also serve as Backend-as-a-Service APIs. Unify your development process with one API for plugins and knowledge integration, and streamline your operations using a single interface for prompt engineering, visual analytics, and continuous improvement.

View File

@ -36,11 +36,11 @@ Click [here](https://dify.ai/) to login to Dify. You can conveniently log in usi
#### 2. Create new datasets[](https://wsyfin.com/notion-dify#2-create-a-new-datasets)[](https://wsyfin.com/notion-dify#2-create-a-new-datasets)
Click the `Datasets` button on the top side bar, followed by the `Create Dataset` button.
Click the `Knowledge` button on the top side bar, followed by the `Create Knowledge` button.
![login-2](https://pan.wsyfin.com/f/G6ziA/login-2.png)
#### 3. Connect with Notion and Your Datasets[](https://wsyfin.com/notion-dify#3-connect-with-notion-and-datasets)
#### 3. Connect with Notion and Your Knowledge[](https://wsyfin.com/notion-dify#3-connect-with-notion-and-datasets)
Select "Sync from Notion" and then click the "Connect" button..
@ -74,7 +74,7 @@ Enjoy your coffee while waiting for the training process to complete.
#### 5. Create Your AI application[](https://wsyfin.com/notion-dify#5-create-your-ai-application) <a href="#5-create-your-own-ai-application" id="5-create-your-own-ai-application"></a>
You must create an AI application and link it with the dataset you've recently created.
You must create an AI application and link it with the knowledge you've recently created.
Return to the dashboard, and click the "Create new APP" button. It's recommended to use the Chat App directly.

View File

@ -14,9 +14,9 @@ Dify provides free message call usage quotas for OpenAI GPT series (200 times) a
### Upload your product documentation or knowledge base.
If you want to build an AI Chatbot based on the company's existing knowledge base and product documents, then you need to upload as many product-related documents as possible to Dify's dataset. Dify helps you **complete segmentation and cleaning of the data.** The Dify dataset supports two indexing modes: high quality and economical. We recommend using the high quality mode, which consumes tokens but provides higher accuracy.
If you want to build an AI Chatbot based on the company's existing knowledge base and product documents, then you need to upload as many product-related documents as possible to Dify's knowledge. Dify helps you **complete segmentation and cleaning of the data.** The Dify knowledge supports two indexing modes: high quality and economical. We recommend using the high quality mode, which consumes tokens but provides higher accuracy.
1. Create a new dataset
1. Create a new knowledge base
2. upload your business data (support batch uploading multiple texts)
3. select the cleaning method
4. Click \[Save and Process], and it will take only a few seconds to complete the processing.
@ -28,7 +28,7 @@ If you want to build an AI Chatbot based on the company's existing knowledge bas
Create a conversational app on the \[Build App] page. Then start setting up the prompt and its front-end user experience interactions.
1. Give the AI instruction: Click on the "Pre Prompt" on the left to edit your Prompt, so that it can play the role of customer service and communicate with users. You can specify its tone, style, and limit it to answer or not answer certain questions.
2. Let AI possess your business knowledge: add the target dataset you just uploaded in the \[context].
2. Let AI possess your business knowledge: add the target knowledge you just uploaded in the \[context].
3. Set up the opening remarks: click "Add Feature" to turn on the feature. The purpose is to add an opening line for AI applications, so that when the user opens the customer service window, it will greet the user first and increase affinity.
4. Set up the "Next Question Suggestion": turn on this feature to "Add Feature". The purpose is to give users a direction for their next question after they have asked one.
5. Choose a suitable model and adjust the parameters: different models can be selected in the upper right corner of the page. The performance and token price consumed by different models are different. In this example, we use the GPT3.5 model.

View File

@ -50,6 +50,6 @@ _Please make sure that the device environment you are using is authorized to use
### Citations and Attributions
If the "Quotations and Attribution" feature is enabled during the application arrangement, the dialogue returns will automatically show the quoted dataset document sources.
If the "Quotations and Attribution" feature is enabled during the application arrangement, the dialogue returns will automatically show the quoted knowledge document sources.
<figure><img src="../.gitbook/assets/image (3).png" alt=""><figcaption></figcaption></figure>

View File

@ -21,7 +21,7 @@ Dify 的数据集功能可以使开发者(甚至非技术人员)以简单的
### 数据集与文档
在 Dify 中,\*\*数据集(Datasets**是一些**文档Documents\*\*的集合。一个数据集可以被整体集成至一个应用中作为上下文使用。文档可以由开发者或运营人员上传,或由其它数据源同步(通常对应数据源中的一个文件单位)。
在 Dify 中,\*\*数据集(Knowledge**是一些**文档Documents\*\*的集合。一个数据集可以被整体集成至一个应用中作为上下文使用。文档可以由开发者或运营人员上传,或由其它数据源同步(通常对应数据源中的一个文件单位)。
**上传文档的步骤:**

View File

@ -13,7 +13,7 @@
进入数据集页面,你可以在左侧的导航中切换至 **API** 页面。在该页面中你可以查看 Dify 提供的数据集 API 文档,并可以在 **API 密钥** 中管理可访问数据集 API 的凭据。
<figure><img src="../../.gitbook/assets/dataset-api-token.png" alt=""><figcaption><p>Dataset API Document</p></figcaption></figure>
<figure><img src="../../.gitbook/assets/dataset-api-token.png" alt=""><figcaption><p>Knowledge API Document</p></figcaption></figure>
### API 调用示例
@ -127,7 +127,7 @@ curl --location --request GET 'https://api.dify.ai/v1/datasets/{dataset_id}/docu
curl 'https://api.dify.ai/v1/datasets/aac47674-31a8-4f12-aab2-9603964c4789/documents/2034e0c1-1b75-4532-849e-24e72666595b/segment' \
--header 'Authorization: Bearer {api_key}' \
--header 'Content-Type: application/json' \
--data-raw $'"segments":[
--data-raw $'"chunks":[
{"content":"Dify means Do it for you",
"keywords":["Dify","Do"]
}

View File

@ -2,7 +2,7 @@
## 功能介绍
此前 [datasets](../../advanced/datasets/ "mention") 功能允许开发者可以直接上传各类格式的长文本、结构化数据来构建数据集,使 AI 应用基于用户上传的最新上下文进行对话。
此前 [knowledge](../../advanced/datasets/ "mention") 功能允许开发者可以直接上传各类格式的长文本、结构化数据来构建数据集,使 AI 应用基于用户上传的最新上下文进行对话。
而本次更新的**外部数据工具**赋能开发者可以使用自有的搜索能力或内部知识库等外部数据作为 LLM 的上下文,通过 API 扩展的方式实现外部数据的获取并嵌入提示词。相比在云端上传数据集,使用**外部数据工具**可以在保障私有数据安全,自定义搜索,获取实时数据等方面有显著优势。

View File

@ -36,7 +36,7 @@ Notion 是一个强大的知识管理工具。它的灵活性和可扩展性使
#### 2.创建新的数据集 <a href="#2-create-a-new-datasets" id="2-create-a-new-datasets"></a>
点击顶部侧边栏的 "Datasets" 按钮,然后点击 "Create Dataset" 按钮。
点击顶部侧边栏的 "Knowledge" 按钮,然后点击 "Create Knowledge" 按钮。
![login-2](https://pan.wsyfin.com/f/G6ziA/login-2.png)