Discovering Full-Text PDFs for Biomedical Literature: An Architectural Essay¶

The challenge of programmatically obtaining full-text PDFs of biomedical literature is deceptively complex. While it might seem straightforward—find a URL, download a file—the reality involves navigating a fragmented landscape of repositories, access controls, anti-bot protections, and constantly evolving publisher policies.

BMLibrarian's PDF discovery system addresses this challenge through a carefully designed architecture that prioritizes reliability, extensibility, and respect for legal access channels.

The Problem Space¶

Academic publishing exists in a state of productive tension. Publishers maintain paywalls to fund peer review and editorial processes. Open access initiatives push for unrestricted knowledge sharing. Institutional subscriptions create archipelagos of access rights. The result is that obtaining a PDF of a given paper may require navigating any of several pathways:

Open Access Repositories: PubMed Central (PMC) hosts millions of freely accessible biomedical papers, either through open access mandates or author submissions.
Open Access Aggregators: Services like Unpaywall index open access versions scattered across institutional repositories, preprint servers, and publisher websites.
Publisher Websites: Direct access through DOI resolution, sometimes freely available, sometimes paywalled.
Institutional Proxies: OpenAthens, Shibboleth, and similar systems authenticate users against institutional subscriptions.

Each pathway has its own API conventions, rate limits, response formats, and failure modes. A robust discovery system must orchestrate these diverse sources while handling the inevitable failures gracefully.

Architectural Philosophy: The Resolver Pattern¶

The core architectural insight driving our system is that each PDF source should be encapsulated as an independent "resolver"—a self-contained unit that knows how to query a specific source and interpret its responses. This pattern provides several benefits:

Isolation of Concerns: Each resolver manages its own API interactions, error handling, and response parsing. When PubMed Central changes their API format, only the PMC resolver needs updating.

Graceful Degradation: If one resolver fails—whether due to network issues, API changes, or rate limiting—the system continues with remaining resolvers. No single point of failure.

Extensibility: Adding a new source requires implementing a single class with a well-defined interface. The orchestrator automatically incorporates new resolvers.

Testability: Each resolver can be unit tested in isolation with mocked responses, enabling confident refactoring.

The abstract base class defines the contract:

class BaseResolver(ABC):
    @property
    @abstractmethod
    def name(self) -> str:
        """Resolver name for logging and identification."""

    @abstractmethod
    def resolve(self, identifiers: DocumentIdentifiers) -> ResolutionResult:
        """Resolve document identifiers to PDF sources."""

Every resolver returns a ResolutionResult that encapsulates not just the found sources, but metadata about the resolution attempt itself—timing information, error messages, and status codes. This instrumentation proves invaluable for debugging and optimization.

The Multi-Source Approach¶

Our system implements five primary resolvers, each targeting a different segment of the academic publishing landscape:

PMCResolver: The Gold Standard for Open Access¶

PubMed Central represents the most reliable source for biomedical PDFs. As a government-mandated repository, it maintains stable APIs and guarantees open access. The PMC resolver:

Queries the PMC OA web service API
Converts PMIDs to PMCIDs when necessary
Falls back to constructed URLs when the API is unavailable
Assigns high priority (5-6) to its sources

UnpaywallResolver: Aggregating Open Access¶

Unpaywall maintains an index of legal open access versions across the web. This is particularly valuable for finding:

Author-deposited versions in institutional repositories
Green open access copies on personal websites
Publisher-provided open access versions

The resolver extracts rich metadata from Unpaywall responses, including license information, version type (published, accepted, submitted), and Unpaywall's own "best OA location" recommendation.

DOIResolver: Direct Publisher Access¶

When open access sources aren't available, DOI resolution provides a pathway to publisher websites. The resolver employs two strategies:

CrossRef API: Queries structured metadata that may include PDF links
Content Negotiation: Requests application/pdf directly from doi.org

DirectURLResolver: Database Knowledge¶

When our database already contains a PDF URL, this simple resolver validates the URL format and passes it through. No external requests needed.

OpenAthensResolver: Institutional Gateway¶

For paywalled content, institutional access via OpenAthens provides a final option. This resolver:

Constructs proxy URLs through the institution's authentication system
Integrates with the OpenAthens authentication module
Receives the lowest priority (50), ensuring free sources are always preferred

The Two-Phase Download Strategy¶

Discovery alone doesn't guarantee successful downloads. Web servers employ various protection mechanisms, and PDF delivery varies widely across publishers. Our download strategy addresses this through a two-phase approach:

Phase 1: Direct HTTP Download¶

For most sources, simple HTTP requests suffice. The system:

Iterates through discovered sources in priority order
Issues GET requests with streaming enabled
Implements exponential backoff retry (2s, 4s, 8s delays)
Validates content type (rejecting HTML login pages)
Verifies PDF magic bytes (%PDF- at file start)

Phase 2: Browser Fallback¶

When HTTP downloads fail—typically due to Cloudflare protection, JavaScript-required pages, or embedded PDF viewers—the system falls back to browser automation via Playwright. This phase:

Launches a headless Chromium instance with anti-detection measures
Handles Cloudflare's "checking your browser" interstitials
Extracts PDFs from embedded viewers
Maintains realistic browser fingerprints

The browser fallback is intentionally positioned as a last resort. It's slower, more resource-intensive, and more fragile. But for the subset of sources that require it, browser automation makes the difference between success and failure.

Type Safety and Data Modeling¶

The discovery system employs rigorous type safety through dataclasses and enums:

class SourceType(Enum):
    DIRECT_URL = "direct_url"
    DOI_REDIRECT = "doi_redirect"
    PMC = "pmc"
    UNPAYWALL = "unpaywall"
    OPENATHENS = "openathens"
    BROWSER = "browser"
    UNKNOWN = "unknown"

class AccessType(Enum):
    OPEN = "open"
    INSTITUTIONAL = "institutional"
    SUBSCRIPTION = "subscription"
    UNKNOWN = "unknown"

These enums ensure that source types and access requirements are always valid values. A typo in a string literal might propagate silently; an invalid enum value raises an immediate error.

Priority-Based Source Selection¶

Not all PDF sources are created equal. A PMC PDF is essentially guaranteed to work; a publisher link might lead to a paywall. Our priority system encodes this knowledge:

Priority	Source Type	Rationale
1-3	Unpaywall "best"	Curated selection, proven accessible
5-6	PMC	Government repository, always free
10	Direct URL	Known URL, but accessibility varies
15-20	DOI/CrossRef	May be paywalled
50	OpenAthens	Requires authentication, last resort

Lower priority values indicate more desirable sources. The system sorts discovered sources by priority and attempts downloads in order, stopping at the first success.

Ethical Considerations¶

Our system is designed with ethical access in mind:

Legal Channels Only: We query legitimate APIs and respect access controls. The system never circumvents paywalls or authentication requirements.

Rate Limiting: Exponential backoff and reasonable timeouts prevent overwhelming servers.

Terms of Service: Unpaywall requires a contact email; we prominently document this requirement.

Open Access Priority: By prioritizing free sources, we minimize load on subscription services and support the open access ecosystem.

Conclusion¶

The full-text PDF discovery system in BMLibrarian represents a pragmatic response to the complexity of academic publishing infrastructure. By encapsulating diverse sources as independent resolvers, implementing robust fallback strategies, and maintaining rigorous type safety, we've created a system that reliably delivers PDFs while remaining adaptable to the evolving publishing landscape.

For biomedical researchers, this means more time analyzing literature and less time wrestling with download failures. For developers extending the system, it means clear patterns and well-defined interfaces. And for the broader ecosystem, it means supporting legal access channels and open access initiatives.

The fragments are united; the PDFs flow.

For technical implementation details, see the Developer Manual. For usage instructions, see the User Guide.