Initial.

2025-12-05 16:57:52 -05:00
commit 2f3d42cf5f
23 changed files with 1264 additions and 0 deletions
--- a/file_processor/dep_pdf_extractor_mod/.hex
+++ b/file_processor/dep_pdf_extractor_mod/.hex
--- a/file_processor/dep_pdf_extractor_mod/CHANGELOG.md
+++ b/file_processor/dep_pdf_extractor_mod/CHANGELOG.md
@@ -0,0 +1,79 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## [Unreleased]
+
+## [0.5.0] - 2025-08-23
+
+- Simplify arguments to avoid repetition
+- Upgrade Python to v3.12 and pdfplumber to v0.11.7
+- Lower Elixir requirement to v1.15
+
+## [0.4.1] - 2025-07-30
+
+- Fix `PdfExtractor.start_link/1` call to link the process correctly to the supervisor tree
+
+## [0.4.0] - 2025-07-21
+
+### Changed
+- Made PdfExtractor a single process to avoid issues with the Python GIL
+
+## [0.3.0] - 2025-07-20
+
+### Added
+- **Multiple Areas Support**: Extract text from multiple bounding box areas on the same page
+- **Metadata Extraction**: New `extract_metadata/1` and `extract_metadata_from_binary/1` functions
+- **Binary PDF Processing**: Extract text and metadata directly from PDF binary data
+- **Enhanced Documentation**: Comprehensive doctests and improved API documentation
+- **Improved Test Coverage**: Added extensive test suite for new functionality
+
+### Changed
+- Enhanced area-based extraction to support lists of areas per page
+- Improved error handling and edge case management
+- Updated type specifications for better developer experience
+
+### Fixed
+- Better handling of invalid page numbers and area coordinates
+- Improved Python environment initialization
+
+## [0.2.1] - 2025-06-27
+
+### Fixed
+- Added automatic Python dependencies download and installation
+- Improved application startup process
+
+## [0.2.0] - 2025-06-22
+
+### Added
+- Project badges and improved README documentation
+- Enhanced configuration and documentation setup
+
+### Changed
+
+- Improved function naming and API consistency
+- Better documentation structure
+
+## [0.1.0] - 2025-06-21
+
+### Added
+- Initial release of PdfExtractor
+- Support for extracting text from PDF files using Python's pdfplumber
+- Single page text extraction
+- Multi-page text extraction
+- Basic area-based text extraction with bounding boxes
+- Initial test suite
+- Basic documentation and examples
+
+### Dependencies
+- pythonx ~> 0.4.0 for Python integration
+- Requires Python with pdfplumber package installed
+
+[Unreleased]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.3.0...HEAD
+[0.3.0]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.2.1...v0.3.0
+[0.2.1]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.2.0...v0.2.1
+[0.2.0]: https://github.com/YOUR_USERNAME/pdf_extractor/compare/v0.1.0...v0.2.0
+[0.1.0]: https://github.com/YOUR_USERNAME/pdf_extractor/releases/tag/v0.1.0
--- a/file_processor/dep_pdf_extractor_mod/LICENSE
+++ b/file_processor/dep_pdf_extractor_mod/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2025 Nelson Estevão
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/file_processor/dep_pdf_extractor_mod/README.md
+++ b/file_processor/dep_pdf_extractor_mod/README.md
@@ -0,0 +1,91 @@
+# PdfExtractor
+
+[![Release](https://img.shields.io/hexpm/v/pdf_extractor.svg)](https://hex.pm/packages/pdf_extractor)
+[![Documentation](https://img.shields.io/badge/docs-hexpm-blue.svg)](https://hexdocs.pm/pdf_extractor)
+[![Downloads](https://img.shields.io/hexpm/dt/pdf_extractor.svg)](https://hex.pm/packages/pdf_extractor)
+[![License](https://img.shields.io/hexpm/l/pdf_extractor.svg)](https://hex.pm/packages/pdf_extractor)
+[![Last Commit](https://img.shields.io/github/last-commit/nelsonmestevao/pdf_extractor.svg)](https://github.com/nelsonmestevao/pdf_extractor)
+
+
+A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.
+
+PdfExtractor leverages Python's `pdfplumber` library through seamless integration to provide
+robust PDF text extraction capabilities. It supports both file-based and binary-based operations,
+making it suitable for various use cases from local file processing to web-based PDF handling.
+
+## Features
+
+- 🔍 Extract text from single or multiple PDF pages
+- 📍 Area-based extraction using bounding boxes
+- 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
+- 📊 Get PDF metadata like title, author, creation date
+- 🐍 Leverages Python's powerful `pdfplumber` library
+- 🚀 Simple and intuitive API
+- ✅ Comprehensive test coverage
+- 📚 Full documentation
+
+## Installation
+
+Add `pdf_extractor` to your list of dependencies in `mix.exs`:
+
+```elixir
+def deps do
+  [
+    {:pdf_extractor, "~> 0.5.0"}
+  ]
+end
+```
+
+Then start it in your application start function:
+
+```elixir
+defmodule MyApp.Application do
+  use Application
+
+  def start(_type, _args) do
+    children = [
+        PdfExtractor,
+        ...
+    ]
+
+    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
+    Supervisor.start_link(children, opts)
+  end
+end
+```
+
+## Usage
+
+Extract text from specific regions using bounding boxes `{x0, y0, x1, y1}`:
+
+```elixir
+areas = %{
+  0 => {0, 0, 300, 200},    # Top-left area of page 0
+  1 => [
+        {200, 300, 600, 500}, # Bottom-right area of page 1
+        {0, 0, 200, 250}, # Top-left area of page 1
+       ]
+}
+PdfExtractor.extract_text("path/to/document.pdf", areas)
+```
+
+### Return Format
+
+The function returns a map where keys are page numbers and values are the extracted text:
+
+```elixir
+%{
+  0 => "Text from page 0...",
+  1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
+  2 => "Text from page 2..."
+}
+```
+
+## License
+
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+
+## Acknowledgments
+
+- Built on top of the excellent [pdfplumber](https://github.com/jsvine/pdfplumber) Python library
+- Uses [pythonx](https://github.com/livebook-dev/pythonx) for seamless Python integration
--- a/file_processor/dep_pdf_extractor_mod/hex_metadata.config
+++ b/file_processor/dep_pdf_extractor_mod/hex_metadata.config
@@ -0,0 +1,22 @@
+{<<"links">>,
+ [{<<"Changelog">>,
+   <<"https://github.com/nelsonmestevao/pdf_extractor/blob/main/CHANGELOG.md">>},
+  {<<"GitHub">>,<<"https://github.com/nelsonmestevao/pdf_extractor">>}]}.
+{<<"name">>,<<"pdf_extractor">>}.
+{<<"version">>,<<"0.5.0">>}.
+{<<"description">>,
+ <<"A lightweight Elixir library for extracting text from PDF files using Python's pdfplumber.\nSupports single and multi-page extraction with optional area filtering.">>}.
+{<<"elixir">>,<<"~> 1.15">>}.
+{<<"app">>,<<"pdf_extractor">>}.
+{<<"files">>,
+ [<<"lib">>,<<"lib/pdf_extractor">>,<<"lib/pdf_extractor/pdf_plumber.ex">>,
+  <<"lib/pdf_extractor.ex">>,<<"mix.exs">>,<<"README.md">>,<<"LICENSE">>,
+  <<"CHANGELOG.md">>]}.
+{<<"licenses">>,[<<"MIT">>]}.
+{<<"requirements">>,
+ [[{<<"name">>,<<"pythonx">>},
+   {<<"app">>,<<"pythonx">>},
+   {<<"optional">>,false},
+   {<<"requirement">>,<<"~> 0.4.4">>},
+   {<<"repository">>,<<"hexpm">>}]]}.
+{<<"build_tools">>,[<<"mix">>]}.
--- a/file_processor/dep_pdf_extractor_mod/lib/pdf_extractor.ex
+++ b/file_processor/dep_pdf_extractor_mod/lib/pdf_extractor.ex
@@ -0,0 +1,228 @@
+defmodule PdfExtractor do
+  @moduledoc "README.md"
+             |> File.read!()
+             |> String.split("\n\n")
+             |> tl()
+             |> tl()
+             |> Enum.join("\n\n")
+  use GenServer
+
+  @external_resource "README.md"
+
+  # Client
+
+  def start_link(opts \\ []) do
+    opts = Keyword.validate!(opts, name: __MODULE__)
+    GenServer.start_link(__MODULE__, [], name: opts[:name])
+  end
+
+  @doc ~S"""
+  Extracts text from PDF pages.
+
+  It supports extracting from single pages, multiple pages, and specific areas within pages.
+
+  ## Page Numbers
+
+  - **Integer**: Extract from single page (e.g., `0` for first page)
+  - **List**: Extract from multiple pages (e.g., `[0, 1, 2]`)
+  - **Empty list** `[]`: Extract from all pages (default)
+
+  ## Areas Format
+
+  Areas are specified as a map where keys are page numbers and values are bounding boxes:
+
+  - **Single area**: `%{0 => {x0, y0, x1, y1}}`
+  - **Multiple areas**: `%{0 => [{x0, y0, x1, y1}, {x2, y2, x3, y3}]}`
+  - **Mixed**: `%{0 => {x0, y0, x1, y1}, 1 => [{x2, y2, x3, y3}, {x4, y4, x5, y5}]}`
+
+  ## Examples
+
+    Extract text from all pages.
+
+      iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf")
+      {:ok,
+       %{
+         0 =>
+           "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
+         1 =>
+           "✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n✂\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
+       }}
+
+    Extract text from only some pages.
+
+      iex> PdfExtractor.extract_text("priv/fixtures/fatura.pdf", [0])
+      {:ok,
+       %{
+         0 =>
+           "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
+       }}
+
+    Extract only the titles in the book chapters.
+
+      iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
+      ...>   2 => {0, 0, 612, 190},
+      ...>   8 => {0, 0, 612, 190},
+      ...>   10 => {0, 0, 612, 190}
+      ...> })
+      {:ok,
+       %{
+         2 => "Introdução – Nota do tradutor",
+         8 => "I. Sobre aproveitar o tempo",
+         10 => "II. Sobre a falta de foco na Leitura"
+       }}
+
+    Extract multiple areas from a single page.
+
+      iex> PdfExtractor.extract_text("priv/fixtures/book.pdf", %{
+      ...>   1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
+      ...> })
+      {:ok,
+       %{
+         1 => [
+           "CARTAS DE UM ESTOICO, Volume I",
+           "Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
+         ]
+       }}
+  """
+  def extract_text(file_path, pages \\ []) do
+    GenServer.call(__MODULE__, {:extract_text, [file_path, pages]})
+  end
+
+  def extract_text_timeout(file_path, pages \\ [], timeout) do
+    GenServer.call(__MODULE__, {:extract_text, [file_path, pages]}, timeout)
+  end
+
+  @doc ~S"""
+  Extracts text from PDF binary data. See `extract_text/3` for details on how to specify pages and areas.
+
+  This function allows you to extract text from PDF data that's already in memory,
+  such as data downloaded from a URL or received via an API. This avoids the need
+  to write the PDF to the filesystem.
+
+  ## Examples
+
+    Extract text from all pages.
+
+      iex> content = File.read!("priv/fixtures/fatura.pdf")
+      ...> PdfExtractor.extract_text_from_binary(content)
+      {:ok,
+       %{
+         0 =>
+           "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €",
+         1 =>
+           "✂\nReceipt Payment part Account / Payable to\nCH4431999123000889012\n✂\nMax Muster & Söhne\nAccount / Payable to\nCH4431999123000889012 Musterstrasse 123\nMax Muster & Söhne 8000 Seldwyla\nMusterstrasse 123\n8000 Seldwyla\nReference\n210000000003139471430009017\nReference\n210000000003139471430009017\nAdditional information\nBestellung vom 15.10.2020\nPayable by (name/address)\nSimon Muster\nPayable by (name/address)\nMusterstrasse 1\nCurrency Amount\nSimon Muster\n8000 Seldwyla\nCHF 1 949.75 Musterstrasse 1\n8000 Seldwyla\nCurrency Amount\nCHF 1 949.75\nAcceptance point"
+       }}
+
+    Extract text from only some pages.
+
+      iex> content = File.read!("priv/fixtures/fatura.pdf")
+      ...> PdfExtractor.extract_text_from_binary(content, [0])
+      {:ok,
+       %{
+         0 =>
+           "Text Example Bill FATURA\n# 2025010002\nData: Jun 21, 2025\nProjeto de lei para:\nSaldo devedor: 1 525,59 €\nElixir Company\nItem Quantidade Avaliar Quantia\nTrabalho 1 1 500,00 € 1 500,00 €\nMais trabalho 1 25,59 € 25,59 €\nSubtotal: 1 525,59 €\nImposto (0%): 0,00 €\nTotal: 1 525,59 €"
+       }}
+
+    Extract only the titles in the book chapters.
+
+      iex> content = File.read!("priv/fixtures/book.pdf")
+      ...>
+      ...> PdfExtractor.extract_text_from_binary(content, %{
+      ...>   2 => {0, 0, 612, 190},
+      ...>   8 => {0, 0, 612, 190},
+      ...>   10 => {0, 0, 612, 190}
+      ...> })
+      {:ok,
+       %{
+         2 => "Introdução – Nota do tradutor",
+         8 => "I. Sobre aproveitar o tempo",
+         10 => "II. Sobre a falta de foco na Leitura"
+       }}
+
+    Extract multiple areas from a single page.
+
+      iex> content = File.read!("priv/fixtures/book.pdf")
+      ...>
+      ...> PdfExtractor.extract_text_from_binary(content, %{
+      ...>   1 => [{0, 100, 612, 140}, {0, 400, 612, 440}]
+      ...> })
+      {:ok,
+       %{
+         1 => [
+           "CARTAS DE UM ESTOICO, Volume I",
+           "Montecristo Editora Ltda.\ne-mail: editora@montecristoeditora.com.br"
+         ]
+       }}
+
+  """
+  def extract_text_from_binary(binary, pages \\ []) do
+    GenServer.call(__MODULE__, {:extract_text_from_binary, [binary, pages]})
+  end
+
+  @doc """
+  Extracts metadata from a PDF file info trailers. Typically includes "CreationDate", "ModDate", "Producer", et cetera.
+
+  ## Examples
+
+      iex> PdfExtractor.extract_metadata("priv/fixtures/book.pdf")
+      {:ok,
+       %{
+         "CreationDate" => "D:20250718212328Z",
+         "Creator" => "Stirling-PDF v0.44.2",
+         "ModDate" => "D:20250718212328Z",
+         "Producer" => "Stirling-PDF v0.44.2"
+       }}
+
+  """
+  def extract_metadata(file_path) do
+    GenServer.call(__MODULE__, {:extract_metadata, [file_path]})
+  end
+
+  @doc """
+  Extracts metadata from PDF binary data. Similar to `extract_metadata/1` but works with PDF data in memory instead of
+  files.
+
+  ## Examples
+
+      iex> content = File.read!("priv/fixtures/book.pdf")
+      ...> PdfExtractor.extract_metadata_from_binary(content)
+      {:ok,
+       %{
+         "CreationDate" => "D:20250718212328Z",
+         "Creator" => "Stirling-PDF v0.44.2",
+         "ModDate" => "D:20250718212328Z",
+         "Producer" => "Stirling-PDF v0.44.2"
+       }}
+
+  """
+  def extract_metadata_from_binary(binary) do
+    GenServer.call(__MODULE__, {:extract_metadata_from_binary, [binary]})
+  end
+
+  # Server
+
+  @doc false
+  @impl true
+  def init([] = state) do
+    try do
+      :ok = PdfExtractor.PdfPlumber.start()
+    rescue
+      e in RuntimeError ->
+        if e.message =~ ~r/Python interpreter has already been initialized/ do
+          :ok
+        else
+          reraise e, __STACKTRACE__
+        end
+    end
+
+    {:ok, state}
+  end
+
+  @doc false
+  @impl true
+  def handle_call({function, args}, _from, state) when is_atom(function) and is_list(args) do
+    {:reply, {:ok, apply(PdfExtractor.PdfPlumber, function, args)}, state}
+  rescue
+    exception in Pythonx.Error -> {:reply, {:error, exception}, state}
+  end
+end
--- a/file_processor/dep_pdf_extractor_mod/lib/pdf_extractor/pdf_plumber.ex
+++ b/file_processor/dep_pdf_extractor_mod/lib/pdf_extractor/pdf_plumber.ex
@@ -0,0 +1,189 @@
+defmodule PdfExtractor.PdfPlumber do
+  @moduledoc false
+
+  def start do
+    Pythonx.uv_init("""
+    [project]
+    name = "pdf_extractor"
+    version = "#{to_string(version())}"
+    requires-python = "==3.12.*"
+    dependencies = [
+      "pdfplumber==0.11.7"
+    ]
+    """)
+  end
+
+  @type area :: {non_neg_integer(), non_neg_integer(), non_neg_integer(), non_neg_integer()}
+  @type page :: non_neg_integer()
+
+  @spec extract_text(
+          file_path :: String.t(),
+          pages :: page() | list(page()) | %{page() => area() | [area()] | nil}
+        ) :: %{page() => String.t() | list(String.t())}
+  def extract_text(file_path, page_number) when is_integer(page_number) do
+    extract_text(file_path, List.wrap(page_number))
+  end
+
+  def extract_text(file_path, pages) when is_list(pages) do
+    """
+    #{python_extract_code()}
+
+    main(file_path.decode('utf-8'), page_numbers, areas)
+    """
+    |> Pythonx.eval(%{
+      "file_path" => file_path,
+      "page_numbers" => pages,
+      "areas" => %{}
+    })
+    |> elem(0)
+    |> Pythonx.decode()
+    |> to_map(pages)
+  end
+
+  def extract_text(file_path, pages) when is_map(pages) do
+    """
+    #{python_extract_code()}
+
+    main(file_path.decode('utf-8'), page_numbers, areas)
+    """
+    |> Pythonx.eval(%{
+      "file_path" => file_path,
+      "page_numbers" => Map.keys(pages),
+      "areas" => pages
+    })
+    |> elem(0)
+    |> Pythonx.decode()
+    |> to_map(Map.keys(pages))
+  end
+
+  @doc """
+    This version avoids the need to put the pdf on a filesystem.
+    This allows this to work
+    url = "https://erlang.org/download/armstrong_thesis_2003.pdf"
+    url |> :httpc.request() |> elem(1) |> elem(2) |> :binary.list_to_bin() |> PdfExtractor.extract_text_from_binary()
+  """
+  def extract_text_from_binary(binary, page_number) when is_integer(page_number) do
+    extract_text_from_binary(binary, List.wrap(page_number))
+  end
+
+  def extract_text_from_binary(binary, pages) when is_list(pages) do
+    """
+    from io import BytesIO
+
+    #{python_extract_code()}
+
+    main(BytesIO(binary), page_numbers, areas)
+    """
+    |> Pythonx.eval(%{
+      "binary" => binary,
+      "page_numbers" => pages,
+      "areas" => %{}
+    })
+    |> elem(0)
+    |> Pythonx.decode()
+    |> to_map(pages)
+  end
+
+  def extract_text_from_binary(binary, pages) when is_map(pages) do
+    """
+    from io import BytesIO
+
+    #{python_extract_code()}
+
+    main(BytesIO(binary), page_numbers, areas)
+    """
+    |> Pythonx.eval(%{
+      "binary" => binary,
+      "page_numbers" => Map.keys(pages),
+      "areas" => pages
+    })
+    |> elem(0)
+    |> Pythonx.decode()
+    |> to_map(Map.keys(pages))
+  end
+
+  defp python_extract_code do
+    """
+    import pdfplumber
+    import logging
+
+    logging.getLogger("pdfminer").setLevel(logging.ERROR)
+
+    def extract_from_page(page, areas=None):
+        if areas is None:
+            return page.extract_text()
+        elif isinstance(areas, list):
+            return [page.within_bbox(area).extract_text() for area in areas]
+        else:
+            return page.within_bbox(areas).extract_text()
+
+    def main(content, page_numbers, areas):
+        results = []
+        with pdfplumber.open(content) as pdf:
+            total_pages = len(pdf.pages)
+            if page_numbers == []:
+              page_numbers = list(range(total_pages))
+            for page_number in page_numbers:
+              if page_number >= 0 and page_number < total_pages:
+                results.append(extract_from_page(pdf.pages[page_number], areas.get(page_number)))
+            return results
+    """
+  end
+
+  def extract_metadata(file_path) do
+    """
+    #{python_extract_metadata_code()}
+
+    main(file_path.decode('utf-8'))
+    """
+    |> Pythonx.eval(%{
+      "file_path" => file_path
+    })
+    |> elem(0)
+    |> Pythonx.decode()
+  end
+
+  def extract_metadata_from_binary(binary) do
+    """
+    from io import BytesIO
+
+    #{python_extract_metadata_code()}
+
+    main(BytesIO(binary))
+    """
+    |> Pythonx.eval(%{
+      "binary" => binary
+    })
+    |> elem(0)
+    |> Pythonx.decode()
+  end
+
+  defp python_extract_metadata_code do
+    """
+    import pdfplumber
+    import logging
+
+    logging.getLogger("pdfminer").setLevel(logging.ERROR)
+
+    def main(content):
+        with pdfplumber.open(content) as pdf:
+          return pdf.metadata
+    """
+  end
+
+  defp to_map(texts, []) when is_list(texts) do
+    texts
+    |> Enum.with_index(&{&2, &1})
+    |> Map.new()
+  end
+
+  defp to_map(texts, page_numbers) when is_list(texts) do
+    page_numbers
+    |> Enum.zip(texts)
+    |> Map.new()
+  end
+
+  defp version do
+    Application.spec(:pdf_extractor, :vsn)
+  end
+end
--- a/file_processor/dep_pdf_extractor_mod/mix.exs
+++ b/file_processor/dep_pdf_extractor_mod/mix.exs
@@ -0,0 +1,90 @@
+defmodule PdfExtractor.MixProject do
+  use Mix.Project
+
+  @app :pdf_extractor
+  @name "PdfExtractor"
+  @version "0.5.0"
+  @source_url "https://github.com/nelsonmestevao/pdf_extractor"
+
+  def project do
+    [
+      name: @name,
+      app: @app,
+      version: @version,
+      elixir: "~> 1.15",
+      start_permanent: Mix.env() == :prod,
+      deps: deps(),
+      description: description(),
+      package: package(),
+      docs: docs(),
+      aliases: aliases(),
+      dialyzer: dialyzer(),
+      source_url: @source_url
+    ]
+  end
+
+  def application do
+    [
+      extra_applications: [:logger]
+    ]
+  end
+
+  defp deps do
+    [
+      {:pythonx, "~> 0.4.4"},
+
+      # tools
+      {:credo, "~> 1.7", only: [:dev, :test], runtime: false},
+      {:dialyxir, "~> 1.4", only: [:dev, :test], runtime: false},
+      {:doctest_formatter, "~> 0.4.0", only: [:dev, :test], runtime: false},
+      {:ex_doc, "~> 0.38", only: :dev, runtime: false},
+      {:styler, "~> 1.0", only: [:dev, :test], runtime: false}
+    ]
+  end
+
+  defp aliases do
+    [
+      "lint.dialyzer": ["dialyzer --format dialyxir"]
+    ]
+  end
+
+  defp description do
+    """
+    A lightweight Elixir library for extracting text from PDF files using Python's pdfplumber.
+    Supports single and multi-page extraction with optional area filtering.
+    """
+  end
+
+  defp package do
+    [
+      name: @app,
+      files: ~w(lib mix.exs README.md LICENSE CHANGELOG*),
+      licenses: ["MIT"],
+      links: %{
+        "GitHub" => @source_url,
+        "Changelog" => "#{@source_url}/blob/main/CHANGELOG.md"
+      },
+      maintainers: ["Nelson Estevão <nelsonmestevao@proton.me>"]
+    ]
+  end
+
+  defp docs do
+    [
+      main: "readme",
+      name: @name,
+      source_ref: "v#{@version}",
+      source_url: @source_url,
+      extras: ["README.md", "CHANGELOG.md", "LICENSE"]
+    ]
+  end
+
+  defp dialyzer do
+    [
+      flags: [:no_opaque],
+      list_unused_filters: true,
+      plt_add_deps: :apps_tree,
+      plt_add_apps: [:ex_unit, :iex, :mix, :credo_naming],
+      plt_file: {:no_warn, "priv/plts/elixir-#{System.version()}-erlang-otp-#{System.otp_release()}.plt"}
+    ]
+  end
+end