Skip to content

Discovering Full-Text PDFs for Biomedical Literature: An Architectural Essay

The challenge of programmatically obtaining full-text PDFs of biomedical literature is deceptively complex. While it might seem straightforward—find a URL, download a file—the reality involves navigating a fragmented landscape of repositories, access controls, anti-bot protections, and constantly evolving publisher policies.

BMLibrarian's PDF discovery system addresses this challenge through a carefully designed architecture that prioritizes reliability, extensibility, and respect for legal access channels.

The Problem Space

Academic publishing exists in a state of productive tension. Publishers maintain paywalls to fund peer review and editorial processes. Open access initiatives push for unrestricted knowledge sharing. Institutional subscriptions create archipelagos of access rights. The result is that obtaining a PDF of a given paper may require navigating any of several pathways:

  1. Open Access Repositories: PubMed Central (PMC) hosts millions of freely accessible biomedical papers, either through open access mandates or author submissions.

  2. Open Access Aggregators: Services like Unpaywall index open access versions scattered across institutional repositories, preprint servers, and publisher websites.

  3. Publisher Websites: Direct access through DOI resolution, sometimes freely available, sometimes paywalled.

  4. Institutional Proxies: OpenAthens, Shibboleth, and similar systems authenticate users against institutional subscriptions.

Each pathway has its own API conventions, rate limits, response formats, and failure modes. A robust discovery system must orchestrate these diverse sources while handling the inevitable failures gracefully.

Architectural Philosophy: The Resolver Pattern

The core architectural insight driving our system is that each PDF source should be encapsulated as an independent "resolver"—a self-contained unit that knows how to query a specific source and interpret its responses. This pattern provides several benefits:

Isolation of Concerns: Each resolver manages its own API interactions, error handling, and response parsing. When PubMed Central changes their API format, only the PMC resolver needs updating.

Graceful Degradation: If one resolver fails—whether due to network issues, API changes, or rate limiting—the system continues with remaining resolvers. No single point of failure.

Extensibility: Adding a new source requires implementing a single class with a well-defined interface. The orchestrator automatically incorporates new resolvers.

Testability: Each resolver can be unit tested in isolation with mocked responses, enabling confident refactoring.

The abstract base class defines the contract:

class BaseResolver(ABC):
    @property
    @abstractmethod
    def name(self) -> str:
        """Resolver name for logging and identification."""

    @abstractmethod
    def resolve(self, identifiers: DocumentIdentifiers) -> ResolutionResult:
        """Resolve document identifiers to PDF sources."""

Every resolver returns a ResolutionResult that encapsulates not just the found sources, but metadata about the resolution attempt itself—timing information, error messages, and status codes. This instrumentation proves invaluable for debugging and optimization.

The Multi-Source Approach

Our system implements five primary resolvers, each targeting a different segment of the academic publishing landscape:

PMCResolver: The Gold Standard for Open Access

PubMed Central represents the most reliable source for biomedical PDFs. As a government-mandated repository, it maintains stable APIs and guarantees open access. The PMC resolver:

  • Queries the PMC OA web service API
  • Converts PMIDs to PMCIDs when necessary
  • Falls back to constructed URLs when the API is unavailable
  • Assigns high priority (5-6) to its sources

UnpaywallResolver: Aggregating Open Access

Unpaywall maintains an index of legal open access versions across the web. This is particularly valuable for finding:

  • Author-deposited versions in institutional repositories
  • Green open access copies on personal websites
  • Publisher-provided open access versions

The resolver extracts rich metadata from Unpaywall responses, including license information, version type (published, accepted, submitted), and Unpaywall's own "best OA location" recommendation.

DOIResolver: Direct Publisher Access

When open access sources aren't available, DOI resolution provides a pathway to publisher websites. The resolver employs two strategies:

  1. CrossRef API: Queries structured metadata that may include PDF links
  2. Content Negotiation: Requests application/pdf directly from doi.org

DirectURLResolver: Database Knowledge

When our database already contains a PDF URL, this simple resolver validates the URL format and passes it through. No external requests needed.

OpenAthensResolver: Institutional Gateway

For paywalled content, institutional access via OpenAthens provides a final option. This resolver:

  • Constructs proxy URLs through the institution's authentication system
  • Integrates with the OpenAthens authentication module
  • Receives the lowest priority (50), ensuring free sources are always preferred

The Two-Phase Download Strategy

Discovery alone doesn't guarantee successful downloads. Web servers employ various protection mechanisms, and PDF delivery varies widely across publishers. Our download strategy addresses this through a two-phase approach:

Phase 1: Direct HTTP Download

For most sources, simple HTTP requests suffice. The system:

  1. Iterates through discovered sources in priority order
  2. Issues GET requests with streaming enabled
  3. Implements exponential backoff retry (2s, 4s, 8s delays)
  4. Validates content type (rejecting HTML login pages)
  5. Verifies PDF magic bytes (%PDF- at file start)

Phase 2: Browser Fallback

When HTTP downloads fail—typically due to Cloudflare protection, JavaScript-required pages, or embedded PDF viewers—the system falls back to browser automation via Playwright. This phase:

  • Launches a headless Chromium instance with anti-detection measures
  • Handles Cloudflare's "checking your browser" interstitials
  • Extracts PDFs from embedded viewers
  • Maintains realistic browser fingerprints

The browser fallback is intentionally positioned as a last resort. It's slower, more resource-intensive, and more fragile. But for the subset of sources that require it, browser automation makes the difference between success and failure.

Type Safety and Data Modeling

The discovery system employs rigorous type safety through dataclasses and enums:

class SourceType(Enum):
    DIRECT_URL = "direct_url"
    DOI_REDIRECT = "doi_redirect"
    PMC = "pmc"
    UNPAYWALL = "unpaywall"
    OPENATHENS = "openathens"
    BROWSER = "browser"
    UNKNOWN = "unknown"

class AccessType(Enum):
    OPEN = "open"
    INSTITUTIONAL = "institutional"
    SUBSCRIPTION = "subscription"
    UNKNOWN = "unknown"

These enums ensure that source types and access requirements are always valid values. A typo in a string literal might propagate silently; an invalid enum value raises an immediate error.

Priority-Based Source Selection

Not all PDF sources are created equal. A PMC PDF is essentially guaranteed to work; a publisher link might lead to a paywall. Our priority system encodes this knowledge:

Priority Source Type Rationale
1-3 Unpaywall "best" Curated selection, proven accessible
5-6 PMC Government repository, always free
10 Direct URL Known URL, but accessibility varies
15-20 DOI/CrossRef May be paywalled
50 OpenAthens Requires authentication, last resort

Lower priority values indicate more desirable sources. The system sorts discovered sources by priority and attempts downloads in order, stopping at the first success.

Ethical Considerations

Our system is designed with ethical access in mind:

Legal Channels Only: We query legitimate APIs and respect access controls. The system never circumvents paywalls or authentication requirements.

Rate Limiting: Exponential backoff and reasonable timeouts prevent overwhelming servers.

Terms of Service: Unpaywall requires a contact email; we prominently document this requirement.

Open Access Priority: By prioritizing free sources, we minimize load on subscription services and support the open access ecosystem.

Conclusion

The full-text PDF discovery system in BMLibrarian represents a pragmatic response to the complexity of academic publishing infrastructure. By encapsulating diverse sources as independent resolvers, implementing robust fallback strategies, and maintaining rigorous type safety, we've created a system that reliably delivers PDFs while remaining adaptable to the evolving publishing landscape.

For biomedical researchers, this means more time analyzing literature and less time wrestling with download failures. For developers extending the system, it means clear patterns and well-defined interfaces. And for the broader ecosystem, it means supporting legal access channels and open access initiatives.

The fragments are united; the PDFs flow.


For technical implementation details, see the Developer Manual. For usage instructions, see the User Guide.