Initial.
This commit is contained in:
BIN
file_processor/dep_pdf_extractor_mod/.hex
Normal file
BIN
file_processor/dep_pdf_extractor_mod/.hex
Normal file
Binary file not shown.
79
file_processor/dep_pdf_extractor_mod/CHANGELOG.md
Normal file
79
file_processor/dep_pdf_extractor_mod/CHANGELOG.md
Normal file
@@ -0,0 +1,79 @@
|
||||
# Changelog
|
||||
|
||||
All notable changes to this project will be documented in this file.
|
||||
|
||||
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
||||
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
||||
|
||||
## [Unreleased]
|
||||
|
||||
## [0.5.0] - 2025-08-23
|
||||
|
||||
- Simplify arguments to avoid repetition
|
||||
- Upgrade Python to v3.12 and pdfplumber to v0.11.7
|
||||
- Lower Elixir requirement to v1.15
|
||||
|
||||
## [0.4.1] - 2025-07-30
|
||||
|
||||
- Fix `PdfExtractor.start_link/1` call to link the process correctly to the supervisor tree
|
||||
|
||||
## [0.4.0] - 2025-07-21
|
||||
|
||||
### Changed
|
||||
- Made PdfExtractor a single process to avoid issues with the Python GIL
|
||||
|
||||
## [0.3.0] - 2025-07-20
|
||||
|
||||
### Added
|
||||
- **Multiple Areas Support**: Extract text from multiple bounding box areas on the same page
|
||||
- **Metadata Extraction**: New `extract_metadata/1` and `extract_metadata_from_binary/1` functions
|
||||
- **Binary PDF Processing**: Extract text and metadata directly from PDF binary data
|
||||
- **Enhanced Documentation**: Comprehensive doctests and improved API documentation
|
||||
- **Improved Test Coverage**: Added extensive test suite for new functionality
|
||||
|
||||
### Changed
|
||||
- Enhanced area-based extraction to support lists of areas per page
|
||||
- Improved error handling and edge case management
|
||||
- Updated type specifications for better developer experience
|
||||
|
||||
### Fixed
|
||||
- Better handling of invalid page numbers and area coordinates
|
||||
- Improved Python environment initialization
|
||||
|
||||
## [0.2.1] - 2025-06-27
|
||||
|
||||
### Fixed
|
||||
- Added automatic Python dependencies download and installation
|
||||
- Improved application startup process
|
||||
|
||||
## [0.2.0] - 2025-06-22
|
||||
|
||||
### Added
|
||||
- Project badges and improved README documentation
|
||||
- Enhanced configuration and documentation setup
|
||||
|
||||
### Changed
|
||||
|
||||
- Improved function naming and API consistency
|
||||
- Better documentation structure
|
||||
|
||||
## [0.1.0] - 2025-06-21
|
||||
|
||||
### Added
|
||||
- Initial release of PdfExtractor
|
||||
- Support for extracting text from PDF files using Python's pdfplumber
|
||||
- Single page text extraction
|
||||
- Multi-page text extraction
|
||||
- Basic area-based text extraction with bounding boxes
|
||||
- Initial test suite
|
||||
- Basic documentation and examples
|
||||
|
||||
### Dependencies
|
||||
- pythonx ~> 0.4.0 for Python integration
|
||||
- Requires Python with pdfplumber package installed
|
||||
|
||||
[Unreleased]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.3.0...HEAD
|
||||
[0.3.0]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.2.1...v0.3.0
|
||||
[0.2.1]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.2.0...v0.2.1
|
||||
[0.2.0]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.1.0...v0.2.0
|
||||
[0.1.0]: https://github.com/YOUR_USERNAME/pdf_extractor/releases/tag/v0.1.0
|
||||
21
file_processor/dep_pdf_extractor_mod/LICENSE
Normal file
21
file_processor/dep_pdf_extractor_mod/LICENSE
Normal file
@@ -0,0 +1,21 @@
|
||||
MIT License
|
||||
|
||||
Copyright (c) 2025 Nelson Estevão
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE.
|
||||
91
file_processor/dep_pdf_extractor_mod/README.md
Normal file
91
file_processor/dep_pdf_extractor_mod/README.md
Normal file
@@ -0,0 +1,91 @@
|
||||
# PdfExtractor
|
||||
|
||||
[](https://hex.pm/packages/pdf_extractor)
|
||||
[](https://hexdocs.pm/pdf_extractor)
|
||||
[](https://hex.pm/packages/pdf_extractor)
|
||||
[](https://hex.pm/packages/pdf_extractor)
|
||||
[](https://github.com/nelsonmestevao/pdf_extractor)
|
||||
|
||||
|
||||
A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.
|
||||
|
||||
PdfExtractor leverages Python's `pdfplumber` library through seamless integration to provide
|
||||
robust PDF text extraction capabilities. It supports both file-based and binary-based operations,
|
||||
making it suitable for various use cases from local file processing to web-based PDF handling.
|
||||
|
||||
## Features
|
||||
|
||||
- 🔍 Extract text from single or multiple PDF pages
|
||||
- 📍 Area-based extraction using bounding boxes
|
||||
- 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
|
||||
- 📊 Get PDF metadata like title, author, creation date
|
||||
- 🐍 Leverages Python's powerful `pdfplumber` library
|
||||
- 🚀 Simple and intuitive API
|
||||
- ✅ Comprehensive test coverage
|
||||
- 📚 Full documentation
|
||||
|
||||
## Installation
|
||||
|
||||
Add `pdf_extractor` to your list of dependencies in `mix.exs`:
|
||||
|
||||
```elixir
|
||||
def deps do
|
||||
[
|
||||
{:pdf_extractor, "~> 0.5.0"}
|
||||
]
|
||||
end
|
||||
```
|
||||
|
||||
Then start it in your application start function:
|
||||
|
||||
```elixir
|
||||
defmodule MyApp.Application do
|
||||
use Application
|
||||
|
||||
def start(_type, _args) do
|
||||
children = [
|
||||
PdfExtractor,
|
||||
...
|
||||
]
|
||||
|
||||
opts = [strategy: :one_for_one, name: MyApp.Supervisor]
|
||||
Supervisor.start_link(children, opts)
|
||||
end
|
||||
end
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
Extract text from specific regions using bounding boxes `{x0, y0, x1, y1}`:
|
||||
|
||||
```elixir
|
||||
areas = %{
|
||||
0 => {0, 0, 300, 200}, # Top-left area of page 0
|
||||
1 => [
|
||||
{200, 300, 600, 500}, # Bottom-right area of page 1
|
||||
{0, 0, 200, 250}, # Top-left area of page 1
|
||||
]
|
||||
}
|
||||
PdfExtractor.extract_text("path/to/document.pdf", areas)
|
||||
```
|
||||
|
||||
### Return Format
|
||||
|
||||
The function returns a map where keys are page numbers and values are the extracted text:
|
||||
|
||||
```elixir
|
||||
%{
|
||||
0 => "Text from page 0...",
|
||||
1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
|
||||
2 => "Text from page 2..."
|
||||
}
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
|
||||
|
||||
## Acknowledgments
|
||||
|
||||
- Built on top of the excellent [pdfplumber](https://github.com/jsvine/pdfplumber) Python library
|
||||
- Uses [pythonx](https://github.com/livebook-dev/pythonx) for seamless Python integration
|
||||
22
file_processor/dep_pdf_extractor_mod/hex_metadata.config
Normal file
22
file_processor/dep_pdf_extractor_mod/hex_metadata.config
Normal file
@@ -0,0 +1,22 @@
|
||||
{<<"links">>,
|
||||
[{<<"Changelog">>,
|
||||
<<"https://github.com/nelsonmestevao/pdf_extractor/blob/main/CHANGELOG.md">>},
|
||||
{<<"GitHub">>,<<"https://github.com/nelsonmestevao/pdf_extractor">>}]}.
|
||||
{<<"name">>,<<"pdf_extractor">>}.
|
||||
{<<"version">>,<<"0.5.0">>}.
|
||||
{<<"description">>,
|
||||
<<"A lightweight Elixir library for extracting text from PDF files using Python's pdfplumber.\nSupports single and multi-page extraction with optional area filtering.">>}.
|
||||
{<<"elixir">>,<<"~> 1.15">>}.
|
||||
{<<"app">>,<<"pdf_extractor">>}.
|
||||
{<<"files">>,
|
||||
[<<"lib">>,<<"lib/pdf_extractor">>,<<"lib/pdf_extractor/pdf_plumber.ex">>,
|
||||
<<"lib/pdf_extractor.ex">>,<<"mix.exs">>,<<"README.md">>,<<"LICENSE">>,
|
||||
<<"CHANGELOG.md">>]}.
|
||||
{<<"licenses">>,[<<"MIT">>]}.
|
||||
{<<"requirements">>,
|
||||
[[{<<"name">>,<<"pythonx">>},
|
||||
{<<"app">>,<<"pythonx">>},
|
||||
{<<"optional">>,false},
|
||||
{<<"requirement">>,<<"~> 0.4.4">>},
|
||||
{<<"repository">>,<<"hexpm">>}]]}.
|
||||
{<<"build_tools">>,[<<"mix">>]}.
|
||||
228
file_processor/dep_pdf_extractor_mod/lib/pdf_extractor.ex
Normal file
228
file_processor/dep_pdf_extractor_mod/lib/pdf_extractor.ex
Normal file
@@ -0,0 +1,228 @@
|
||||
defmodule PdfExtractor do
|
||||
@moduledoc "README.md"
|
||||
|> File.read!()
|
||||
|> String.split("\n\n")
|
||||
|> tl()
|
||||
|> tl()
|
||||
|> Enum.join("\n\n")
|
||||
use GenServer
|
||||
|
||||
@external_resource "README.md"
|
||||
|
||||
# Client
|
||||
|
||||
def start_link(opts \\ []) do
|
||||
opts = Keyword.validate!(opts, name: __MODULE__)
|
||||
GenServer.start_link(__MODULE__, [], name: opts[:name])
|
||||
end
|
||||
|
||||
@doc ~S"""
|
||||
Extracts text from PDF pages.
|
||||
|
||||
It supports extracting from single pages, multiple pages, and specific areas within pages.
|
||||
|
||||
## Page Numbers
|
||||
|
||||
- **Integer**: Extract from single page (e.g., `0` for first page)
|
||||
- **List**: Extract from multiple pages (e.g., `[0, 1, 2]`)
|
||||
- **Empty list** `[]`: Extract from all pages (default)
|
||||
|
||||
## Areas Format
|
||||
|
||||
Areas are specified as a map where keys are page numbers and values are bounding boxes:
|
||||
|
||||
- **Single area**: `%{0 => {x0, y0, x1, y1}}`
|
||||
- **Multiple areas**: `%{0 => [{x0, y0, x1, y1}, {x2, y2, x3, y3}]}`
|
||||
- **Mixed**: `%{0 => {x0, y0, x1, y1}, 1 => [{x2, y2, x3, y3}, {x4, y4, x5, y5}]}`
|
||||
|
||||
## Examples
|
||||
|
||||
Extract text from all pages.
|
||||
|
||||
iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf")
|
||||
{:ok,
|
||||
%{
|
||||
0 =>
|
||||
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
|
||||
1 =>
|
||||
"✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n✂\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
|
||||
}}
|
||||
|
||||
Extract text from only some pages.
|
||||
|
||||
iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf", [0])
|
||||
{:ok,
|
||||
%{
|
||||
0 =>
|
||||
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
|
||||
}}
|
||||
|
||||
Extract only the titles in the book chapters.
|
||||
|
||||
iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
|
||||
...> 2 => {0, 0, 612, 190},
|
||||
...> 8 => {0, 0, 612, 190},
|
||||
...> 10 => {0, 0, 612, 190}
|
||||
...> })
|
||||
{:ok,
|
||||
%{
|
||||
2 => "Introdução – Nota do tradutor",
|
||||
8 => "I. Sobre aproveitar o tempo",
|
||||
10 => "II. Sobre a falta de foco na Leitura"
|
||||
}}
|
||||
|
||||
Extract multiple areas from a single page.
|
||||
|
||||
iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
|
||||
...> 1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
|
||||
...> })
|
||||
{:ok,
|
||||
%{
|
||||
1 => [
|
||||
"CARTAS DE UM ESTOICO, Volume I",
|
||||
"Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
|
||||
]
|
||||
}}
|
||||
"""
|
||||
def extract_text(file_path, pages \\ []) do
|
||||
GenServer.call(__MODULE__, {:extract_text, [file_path, pages]})
|
||||
end
|
||||
|
||||
def extract_text_timeout(file_path, pages \\ [], timeout) do
|
||||
GenServer.call(__MODULE__, {:extract_text, [file_path, pages]}, timeout)
|
||||
end
|
||||
|
||||
@doc ~S"""
|
||||
Extracts text from PDF binary data. See `extract_text/3` for details on how to specify pages and areas.
|
||||
|
||||
This function allows you to extract text from PDF data that's already in memory,
|
||||
such as data downloaded from a URL or received via an API. This avoids the need
|
||||
to write the PDF to the filesystem.
|
||||
|
||||
## Examples
|
||||
|
||||
Extract text from all pages.
|
||||
|
||||
iex> content = File.read!("priv/fixtures/fatura.pdf")
|
||||
...> PdfExtractor.extract_text_from_binary(content)
|
||||
{:ok,
|
||||
%{
|
||||
0 =>
|
||||
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
|
||||
1 =>
|
||||
"✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n✂\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
|
||||
}}
|
||||
|
||||
Extract text from only some pages.
|
||||
|
||||
iex> content = File.read!("priv/fixtures/fatura.pdf")
|
||||
...> PdfExtractor.extract_text_from_binary(content, [0])
|
||||
{:ok,
|
||||
%{
|
||||
0 =>
|
||||
"Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
|
||||
}}
|
||||
|
||||
Extract only the titles in the book chapters.
|
||||
|
||||
iex> content = File.read!("priv/fixtures/book.pdf")
|
||||
...>
|
||||
...> PdfExtractor.extract_text_from_binary(content, %{
|
||||
...> 2 => {0, 0, 612, 190},
|
||||
...> 8 => {0, 0, 612, 190},
|
||||
...> 10 => {0, 0, 612, 190}
|
||||
...> })
|
||||
{:ok,
|
||||
%{
|
||||
2 => "Introdução – Nota do tradutor",
|
||||
8 => "I. Sobre aproveitar o tempo",
|
||||
10 => "II. Sobre a falta de foco na Leitura"
|
||||
}}
|
||||
|
||||
Extract multiple areas from a single page.
|
||||
|
||||
iex> content = File.read!("priv/fixtures/book.pdf")
|
||||
...>
|
||||
...> PdfExtractor.extract_text_from_binary(content, %{
|
||||
...> 1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
|
||||
...> })
|
||||
{:ok,
|
||||
%{
|
||||
1 => [
|
||||
"CARTAS DE UM ESTOICO, Volume I",
|
||||
"Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
|
||||
]
|
||||
}}
|
||||
|
||||
"""
|
||||
def extract_text_from_binary(binary, pages \\ []) do
|
||||
GenServer.call(__MODULE__, {:extract_text_from_binary, [binary, pages]})
|
||||
end
|
||||
|
||||
@doc """
|
||||
Extracts metadata from a PDF file info trailers. Typically includes "CreationDate", "ModDate", "Producer", et cetera.
|
||||
|
||||
## Examples
|
||||
|
||||
iex> PdfExtractor.extract_metadata("priv/fixtures/book.pdf")
|
||||
{:ok,
|
||||
%{
|
||||
"CreationDate" => "D:20250718212328Z",
|
||||
"Creator" => "Stirling-PDF v0.44.2",
|
||||
"ModDate" => "D:20250718212328Z",
|
||||
"Producer" => "Stirling-PDF v0.44.2"
|
||||
}}
|
||||
|
||||
"""
|
||||
def extract_metadata(file_path) do
|
||||
GenServer.call(__MODULE__, {:extract_metadata, [file_path]})
|
||||
end
|
||||
|
||||
@doc """
|
||||
Extracts metadata from PDF binary data. Similar to `extract_metadata/1` but works with PDF data in memory instead of
|
||||
files.
|
||||
|
||||
## Examples
|
||||
|
||||
iex> content = File.read!("priv/fixtures/book.pdf")
|
||||
...> PdfExtractor.extract_metadata_from_binary(content)
|
||||
{:ok,
|
||||
%{
|
||||
"CreationDate" => "D:20250718212328Z",
|
||||
"Creator" => "Stirling-PDF v0.44.2",
|
||||
"ModDate" => "D:20250718212328Z",
|
||||
"Producer" => "Stirling-PDF v0.44.2"
|
||||
}}
|
||||
|
||||
"""
|
||||
def extract_metadata_from_binary(binary) do
|
||||
GenServer.call(__MODULE__, {:extract_metadata_from_binary, [binary]})
|
||||
end
|
||||
|
||||
# Server
|
||||
|
||||
@doc false
|
||||
@impl true
|
||||
def init([] = state) do
|
||||
try do
|
||||
:ok = PdfExtractor.PdfPlumber.start()
|
||||
rescue
|
||||
e in RuntimeError ->
|
||||
if e.message =~ ~r/Python interpreter has already been initialized/ do
|
||||
:ok
|
||||
else
|
||||
reraise e, __STACKTRACE__
|
||||
end
|
||||
end
|
||||
|
||||
{:ok, state}
|
||||
end
|
||||
|
||||
@doc false
|
||||
@impl true
|
||||
def handle_call({function, args}, _from, state) when is_atom(function) and is_list(args) do
|
||||
{:reply, {:ok, apply(PdfExtractor.PdfPlumber, function, args)}, state}
|
||||
rescue
|
||||
exception in Pythonx.Error -> {:reply, {:error, exception}, state}
|
||||
end
|
||||
end
|
||||
@@ -0,0 +1,189 @@
|
||||
defmodule PdfExtractor.PdfPlumber do
|
||||
@moduledoc false
|
||||
|
||||
def start do
|
||||
Pythonx.uv_init("""
|
||||
[project]
|
||||
name = "pdf_extractor"
|
||||
version = "#{to_string(version())}"
|
||||
requires-python = "==3.12.*"
|
||||
dependencies = [
|
||||
"pdfplumber==0.11.7"
|
||||
]
|
||||
""")
|
||||
end
|
||||
|
||||
@type area :: {non_neg_integer(), non_neg_integer(), non_neg_integer(), non_neg_integer()}
|
||||
@type page :: non_neg_integer()
|
||||
|
||||
@spec extract_text(
|
||||
file_path :: String.t(),
|
||||
pages :: page() | list(page()) | %{page() => area() | [area()] | nil}
|
||||
) :: %{page() => String.t() | list(String.t())}
|
||||
def extract_text(file_path, page_number) when is_integer(page_number) do
|
||||
extract_text(file_path, List.wrap(page_number))
|
||||
end
|
||||
|
||||
def extract_text(file_path, pages) when is_list(pages) do
|
||||
"""
|
||||
#{python_extract_code()}
|
||||
|
||||
main(file_path.decode('utf-8'), page_numbers, areas)
|
||||
"""
|
||||
|> Pythonx.eval(%{
|
||||
"file_path" => file_path,
|
||||
"page_numbers" => pages,
|
||||
"areas" => %{}
|
||||
})
|
||||
|> elem(0)
|
||||
|> Pythonx.decode()
|
||||
|> to_map(pages)
|
||||
end
|
||||
|
||||
def extract_text(file_path, pages) when is_map(pages) do
|
||||
"""
|
||||
#{python_extract_code()}
|
||||
|
||||
main(file_path.decode('utf-8'), page_numbers, areas)
|
||||
"""
|
||||
|> Pythonx.eval(%{
|
||||
"file_path" => file_path,
|
||||
"page_numbers" => Map.keys(pages),
|
||||
"areas" => pages
|
||||
})
|
||||
|> elem(0)
|
||||
|> Pythonx.decode()
|
||||
|> to_map(Map.keys(pages))
|
||||
end
|
||||
|
||||
@doc """
|
||||
This version avoids the need to put the pdf on a filesystem.
|
||||
This allows this to work
|
||||
url = "https://erlang.org/download/armstrong_thesis_2003.pdf"
|
||||
url |> :httpc.request() |> elem(1) |> elem(2) |> :binary.list_to_bin() |> PdfExtractor.extract_text_from_binary()
|
||||
"""
|
||||
def extract_text_from_binary(binary, page_number) when is_integer(page_number) do
|
||||
extract_text_from_binary(binary, List.wrap(page_number))
|
||||
end
|
||||
|
||||
def extract_text_from_binary(binary, pages) when is_list(pages) do
|
||||
"""
|
||||
from io import BytesIO
|
||||
|
||||
#{python_extract_code()}
|
||||
|
||||
main(BytesIO(binary), page_numbers, areas)
|
||||
"""
|
||||
|> Pythonx.eval(%{
|
||||
"binary" => binary,
|
||||
"page_numbers" => pages,
|
||||
"areas" => %{}
|
||||
})
|
||||
|> elem(0)
|
||||
|> Pythonx.decode()
|
||||
|> to_map(pages)
|
||||
end
|
||||
|
||||
def extract_text_from_binary(binary, pages) when is_map(pages) do
|
||||
"""
|
||||
from io import BytesIO
|
||||
|
||||
#{python_extract_code()}
|
||||
|
||||
main(BytesIO(binary), page_numbers, areas)
|
||||
"""
|
||||
|> Pythonx.eval(%{
|
||||
"binary" => binary,
|
||||
"page_numbers" => Map.keys(pages),
|
||||
"areas" => pages
|
||||
})
|
||||
|> elem(0)
|
||||
|> Pythonx.decode()
|
||||
|> to_map(Map.keys(pages))
|
||||
end
|
||||
|
||||
defp python_extract_code do
|
||||
"""
|
||||
import pdfplumber
|
||||
import logging
|
||||
|
||||
logging.getLogger("pdfminer").setLevel(logging.ERROR)
|
||||
|
||||
def extract_from_page(page, areas=None):
|
||||
if areas is None:
|
||||
return page.extract_text()
|
||||
elif isinstance(areas, list):
|
||||
return [page.within_bbox(area).extract_text() for area in areas]
|
||||
else:
|
||||
return page.within_bbox(areas).extract_text()
|
||||
|
||||
def main(content, page_numbers, areas):
|
||||
results = []
|
||||
with pdfplumber.open(content) as pdf:
|
||||
total_pages = len(pdf.pages)
|
||||
if page_numbers == []:
|
||||
page_numbers = list(range(total_pages))
|
||||
for page_number in page_numbers:
|
||||
if page_number >= 0 and page_number < total_pages:
|
||||
results.append(extract_from_page(pdf.pages[page_number], areas.get(page_number)))
|
||||
return results
|
||||
"""
|
||||
end
|
||||
|
||||
def extract_metadata(file_path) do
|
||||
"""
|
||||
#{python_extract_metadata_code()}
|
||||
|
||||
main(file_path.decode('utf-8'))
|
||||
"""
|
||||
|> Pythonx.eval(%{
|
||||
"file_path" => file_path
|
||||
})
|
||||
|> elem(0)
|
||||
|> Pythonx.decode()
|
||||
end
|
||||
|
||||
def extract_metadata_from_binary(binary) do
|
||||
"""
|
||||
from io import BytesIO
|
||||
|
||||
#{python_extract_metadata_code()}
|
||||
|
||||
main(BytesIO(binary))
|
||||
"""
|
||||
|> Pythonx.eval(%{
|
||||
"binary" => binary
|
||||
})
|
||||
|> elem(0)
|
||||
|> Pythonx.decode()
|
||||
end
|
||||
|
||||
defp python_extract_metadata_code do
|
||||
"""
|
||||
import pdfplumber
|
||||
import logging
|
||||
|
||||
logging.getLogger("pdfminer").setLevel(logging.ERROR)
|
||||
|
||||
def main(content):
|
||||
with pdfplumber.open(content) as pdf:
|
||||
return pdf.metadata
|
||||
"""
|
||||
end
|
||||
|
||||
defp to_map(texts, []) when is_list(texts) do
|
||||
texts
|
||||
|> Enum.with_index(&{&2, &1})
|
||||
|> Map.new()
|
||||
end
|
||||
|
||||
defp to_map(texts, page_numbers) when is_list(texts) do
|
||||
page_numbers
|
||||
|> Enum.zip(texts)
|
||||
|> Map.new()
|
||||
end
|
||||
|
||||
defp version do
|
||||
Application.spec(:pdf_extractor, :vsn)
|
||||
end
|
||||
end
|
||||
90
file_processor/dep_pdf_extractor_mod/mix.exs
Normal file
90
file_processor/dep_pdf_extractor_mod/mix.exs
Normal file
@@ -0,0 +1,90 @@
|
||||
defmodule PdfExtractor.MixProject do
|
||||
use Mix.Project
|
||||
|
||||
@app :pdf_extractor
|
||||
@name "PdfExtractor"
|
||||
@version "0.5.0"
|
||||
@source_url "https://github.com/nelsonmestevao/pdf_extractor"
|
||||
|
||||
def project do
|
||||
[
|
||||
name: @name,
|
||||
app: @app,
|
||||
version: @version,
|
||||
elixir: "~> 1.15",
|
||||
start_permanent: Mix.env() == :prod,
|
||||
deps: deps(),
|
||||
description: description(),
|
||||
package: package(),
|
||||
docs: docs(),
|
||||
aliases: aliases(),
|
||||
dialyzer: dialyzer(),
|
||||
source_url: @source_url
|
||||
]
|
||||
end
|
||||
|
||||
def application do
|
||||
[
|
||||
extra_applications: [:logger]
|
||||
]
|
||||
end
|
||||
|
||||
defp deps do
|
||||
[
|
||||
{:pythonx, "~> 0.4.4"},
|
||||
|
||||
# tools
|
||||
{:credo, "~> 1.7", only: [:dev, :test], runtime: false},
|
||||
{:dialyxir, "~> 1.4", only: [:dev, :test], runtime: false},
|
||||
{:doctest_formatter, "~> 0.4.0", only: [:dev, :test], runtime: false},
|
||||
{:ex_doc, "~> 0.38", only: :dev, runtime: false},
|
||||
{:styler, "~> 1.0", only: [:dev, :test], runtime: false}
|
||||
]
|
||||
end
|
||||
|
||||
defp aliases do
|
||||
[
|
||||
"lint.dialyzer": ["dialyzer --format dialyxir"]
|
||||
]
|
||||
end
|
||||
|
||||
defp description do
|
||||
"""
|
||||
A lightweight Elixir library for extracting text from PDF files using Python's pdfplumber.
|
||||
Supports single and multi-page extraction with optional area filtering.
|
||||
"""
|
||||
end
|
||||
|
||||
defp package do
|
||||
[
|
||||
name: @app,
|
||||
files: ~w(lib mix.exs README.md LICENSE CHANGELOG*),
|
||||
licenses: ["MIT"],
|
||||
links: %{
|
||||
"GitHub" => @source_url,
|
||||
"Changelog" => "#{@source_url}/blob/main/CHANGELOG.md"
|
||||
},
|
||||
maintainers: ["Nelson Estevão <nelsonmestevao@proton.me>"]
|
||||
]
|
||||
end
|
||||
|
||||
defp docs do
|
||||
[
|
||||
main: "readme",
|
||||
name: @name,
|
||||
source_ref: "v#{@version}",
|
||||
source_url: @source_url,
|
||||
extras: ["README.md", "CHANGELOG.md", "LICENSE"]
|
||||
]
|
||||
end
|
||||
|
||||
defp dialyzer do
|
||||
[
|
||||
flags: [:no_opaque],
|
||||
list_unused_filters: true,
|
||||
plt_add_deps: :apps_tree,
|
||||
plt_add_apps: [:ex_unit, :iex, :mix, :credo_naming],
|
||||
plt_file: {:no_warn, "priv/plts/elixir-#{System.version()}-erlang-otp-#{System.otp_release()}.plt"}
|
||||
]
|
||||
end
|
||||
end
|
||||
Reference in New Issue
Block a user