Skip to content

Loaders

flowtask.components.LangchainLoader.loaders

abstract

AbstractLoader

AbstractLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)

Bases: ABC

Abstract class for Document loaders.

get_default_llm
get_default_llm()

Return a VertexLLM instance.

get_summary_from_text
get_summary_from_text(text, use_gpu=False)

Get a summary of a text.

load async
load(path)

Load data from a source and return it as a Langchain Document.

Parameters:

Name Type Description Default
path Union[str, PurePath, List[PurePath]]

The source of the data.

required

Returns:

Type Description
List[Document]

List[Document]: A list of Langchain Documents.

resolve_paths
resolve_paths(path)

Resolve the input path into a list of file paths. Handles lists, directories, glob patterns, and single file paths.

Parameters:

Name Type Description Default
path Union[str, PurePath, List[PurePath]]

Input path(s).

required

Returns:

Type Description
List[Path]

List[Path]: A list of resolved file paths.

basepdf

BasePDF

BasePDF(**kwargs)

Bases: AbstractLoader

Base Abstract loader for all PDF-file Loaders.

load async
load(path)

Load data from a source and return it as a Langchain Document.

Parameters:

Name Type Description Default
path Union[str, PurePath, List[PurePath]]

The source of the data.

required

Returns:

Type Description
List[Document]

List[Document]: A list of Langchain Documents.

docx

MSWordLoader

MSWordLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)

Bases: AbstractLoader

Load Microsoft Docx as Langchain Documents.

extract_text
extract_text(path)

Extract text from a docx file.

Parameters:

Name Type Description Default
path Path

The source of the data.

required

Returns:

Name Type Description
str

The extracted text.

html

HTMLLoader

HTMLLoader(**kwargs)

Bases: AbstractLoader

Loader for HTML files to convert into Langchain Documents.

Processes HTML files, extracts relevant content, converts to Markdown, and associates metadata with each document.

pdfblocks

PDFBlocks

PDFBlocks(table_settings={}, **kwargs)

Bases: BasePDF

Load a PDF Table as Blocks of text.

get_markdown
get_markdown(df)

Convert a DataFrame to a Markdown string.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to convert.

required

Returns:

Name Type Description
str str

The JSON string.

unique_columns
unique_columns(df)

Rename duplicate columns in the DataFrame to ensure they are unique.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame with potential duplicate column names.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with unique column names.

pdfmark

PDFMarkdown

PDFMarkdown(**kwargs)

Bases: BasePDF

Loader for PDF files converted content to markdown.

pdftables

PDFTables

PDFTables(table_settings={}, **kwargs)

Bases: BasePDF

Loader for Tables present on PDF Files.

get_markdown
get_markdown(df)

Convert a DataFrame to a Markdown string.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame to convert.

required

Returns:

Name Type Description
str str

The JSON string.

unique_columns
unique_columns(df)

Rename duplicate columns in the DataFrame to ensure they are unique.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame with potential duplicate column names.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with unique column names.

qa

QAFileLoader

QAFileLoader(columns=['Question', 'Answer'], **kwargs)

Bases: AbstractLoader

Question and Answers File based on Excel, coverted to Langchain Documents.

load async
load(path)

Load data from a source and return it as a Langchain Document.

Parameters:

Name Type Description Default
path Path

The source of the data.

required

Returns:

Type Description
List[Document]

List[Document]: A list of Langchain Documents.

txt

TXTLoader

TXTLoader(tokenizer=None, text_splitter=None, summarizer=None, markdown_splitter=None, source_type='file', doctype='document', device=None, cuda_number=0, llm=None, **kwargs)

Bases: AbstractLoader

Loader for PDF files.