Skip to content

Langchainloader

flowtask.components.LangchainLoader

LangchainLoader

LangchainLoader(loop=None, job=None, stat=None, **kwargs)

Bases: FlowComponent

LangchainLoader.

Overview:

Getting a list of documents and convert into Langchain Documents.

Example:

```yaml
LangchainLoader:
    path: /home/ubuntu/symbits/lg/bot/products_positive
    source_type: Product-Top-Reviews
    loader: HTMLLoader
    chunk_size: 2048
    elements:
    - div: .product
```

get_default_llm

get_default_llm()

Return a VertexLLM instance.

loader

LangchainLoader

LangchainLoader(loop=None, job=None, stat=None, **kwargs)

Bases: FlowComponent

LangchainLoader.

Overview:

Getting a list of documents and convert into Langchain Documents.

Example:

```yaml
LangchainLoader:
    path: /home/ubuntu/symbits/lg/bot/products_positive
    source_type: Product-Top-Reviews
    loader: HTMLLoader
    chunk_size: 2048
    elements:
    - div: .product
```
get_default_llm
get_default_llm()

Return a VertexLLM instance.

loaders

abstract

AbstractLoader
AbstractLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)

Bases: ABC

Abstract class for Document loaders.

get_default_llm
get_default_llm()

Return a VertexLLM instance.

get_summary_from_text
get_summary_from_text(text, use_gpu=False)

Get a summary of a text.

load async
load(path)

Load data from a source and return it as a Langchain Document.

Parameters:

Name Type Description Default
path Union[str, PurePath, List[PurePath]]

The source of the data.

required

Returns:

Type Description
List[Document]

List[Document]: A list of Langchain Documents.

resolve_paths
resolve_paths(path)

Resolve the input path into a list of file paths. Handles lists, directories, glob patterns, and single file paths.

Parameters:

Name Type Description Default
path Union[str, PurePath, List[PurePath]]

Input path(s).

required

Returns:

Type Description
List[Path]

List[Path]: A list of resolved file paths.

basepdf

BasePDF
BasePDF(**kwargs)

Bases: AbstractLoader

Base Abstract loader for all PDF-file Loaders.

load async
load(path)

Load data from a source and return it as a Langchain Document.

Parameters:

Name Type Description Default
path Union[str, PurePath, List[PurePath]]

The source of the data.

required

Returns:

Type Description
List[Document]

List[Document]: A list of Langchain Documents.

docx

MSWordLoader
MSWordLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)

Bases: AbstractLoader

Load Microsoft Docx as Langchain Documents.

extract_text
extract_text(path)

Extract text from a docx file.

Parameters:

Name Type Description Default
path Path

The source of the data.

required

Returns:

Name Type Description
str

The extracted text.

html

HTMLLoader
HTMLLoader(**kwargs)

Bases: AbstractLoader

Loader for HTML files to convert into Langchain Documents.

Processes HTML files, extracts relevant content, converts to Markdown, and associates metadata with each document.

pdfblocks

PDFBlocks
PDFBlocks(table_settings={}, **kwargs)

Bases: BasePDF

Load a PDF Table as Blocks of text.

get_markdown
get_markdown(df)

Convert a DataFrame to a Markdown string.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to convert.

required

Returns:

Name Type Description
str str

The JSON string.

unique_columns
unique_columns(df)

Rename duplicate columns in the DataFrame to ensure they are unique.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame with potential duplicate column names.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with unique column names.

pdfmark

PDFMarkdown
PDFMarkdown(**kwargs)

Bases: BasePDF

Loader for PDF files converted content to markdown.

pdftables

PDFTables
PDFTables(table_settings={}, **kwargs)

Bases: BasePDF

Loader for Tables present on PDF Files.

get_markdown
get_markdown(df)

Convert a DataFrame to a Markdown string.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to convert.

required

Returns:

Name Type Description
str str

The JSON string.

unique_columns
unique_columns(df)

Rename duplicate columns in the DataFrame to ensure they are unique.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame with potential duplicate column names.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with unique column names.

qa

QAFileLoader
QAFileLoader(columns=['Question', 'Answer'], **kwargs)

Bases: AbstractLoader

Question and Answers File based on Excel, coverted to Langchain Documents.

load async
load(path)

Load data from a source and return it as a Langchain Document.

Parameters:

Name Type Description Default
path Path

The source of the data.

required

Returns:

Type Description
List[Document]

List[Document]: A list of Langchain Documents.

txt

TXTLoader
TXTLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)

Bases: AbstractLoader

Loader for PDF files.