Langchainloader¶
flowtask.components.LangchainLoader
¶
LangchainLoader
¶
Bases: FlowComponent
LangchainLoader.
Overview:
Getting a list of documents and convert into Langchain Documents.
Example:
```yaml
LangchainLoader:
path: /home/ubuntu/symbits/lg/bot/products_positive
source_type: Product-Top-Reviews
loader: HTMLLoader
chunk_size: 2048
elements:
- div: .product
```
loader
¶
LangchainLoader
¶
Bases: FlowComponent
LangchainLoader.
Overview:
Getting a list of documents and convert into Langchain Documents.
Example:
```yaml
LangchainLoader:
path: /home/ubuntu/symbits/lg/bot/products_positive
source_type: Product-Top-Reviews
loader: HTMLLoader
chunk_size: 2048
elements:
- div: .product
```
loaders
¶
abstract
¶
AbstractLoader
¶
AbstractLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)
Bases: ABC
Abstract class for Document loaders.
load
async
¶
Load data from a source and return it as a Langchain Document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, PurePath, List[PurePath]]
|
The source of the data. |
required |
Returns:
| Type | Description |
|---|---|
List[Document]
|
List[Document]: A list of Langchain Documents. |
resolve_paths
¶
Resolve the input path into a list of file paths. Handles lists, directories, glob patterns, and single file paths.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, PurePath, List[PurePath]]
|
Input path(s). |
required |
Returns:
| Type | Description |
|---|---|
List[Path]
|
List[Path]: A list of resolved file paths. |
basepdf
¶
BasePDF
¶
Bases: AbstractLoader
Base Abstract loader for all PDF-file Loaders.
load
async
¶
Load data from a source and return it as a Langchain Document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, PurePath, List[PurePath]]
|
The source of the data. |
required |
Returns:
| Type | Description |
|---|---|
List[Document]
|
List[Document]: A list of Langchain Documents. |
docx
¶
MSWordLoader
¶
MSWordLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)
Bases: AbstractLoader
Load Microsoft Docx as Langchain Documents.
extract_text
¶
Extract text from a docx file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
The source of the data. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
The extracted text. |
html
¶
HTMLLoader
¶
Bases: AbstractLoader
Loader for HTML files to convert into Langchain Documents.
Processes HTML files, extracts relevant content, converts to Markdown, and associates metadata with each document.
pdfblocks
¶
PDFBlocks
¶
Bases: BasePDF
Load a PDF Table as Blocks of text.
get_markdown
¶
Convert a DataFrame to a Markdown string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to convert. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The JSON string. |
unique_columns
¶
Rename duplicate columns in the DataFrame to ensure they are unique.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame with potential duplicate column names. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A DataFrame with unique column names. |
pdfmark
¶
pdftables
¶
PDFTables
¶
Bases: BasePDF
Loader for Tables present on PDF Files.
get_markdown
¶
Convert a DataFrame to a Markdown string.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame to convert. |
required |
Returns:
| Name | Type | Description |
|---|---|---|
str |
str
|
The JSON string. |
unique_columns
¶
Rename duplicate columns in the DataFrame to ensure they are unique.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
The DataFrame with potential duplicate column names. |
required |
Returns:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: A DataFrame with unique column names. |
qa
¶
QAFileLoader
¶
Bases: AbstractLoader
Question and Answers File based on Excel, coverted to Langchain Documents.
load
async
¶
Load data from a source and return it as a Langchain Document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
The source of the data. |
required |
Returns:
| Type | Description |
|---|---|
List[Document]
|
List[Document]: A list of Langchain Documents. |