kwlch.dev

Type-Safe Pagination in Python with Protocols

2026-04-18T00:00:00Z

I wanted a generic pagination function for a GraphQL API, but the generated types wouldn't let me write one. I was using ariadne-codegen to generate typed Python stubs from a GraphQL schema — the types are correct, but you can't fully control their shape.

Each endpoint returned a discriminated union — a type that could be one of several unrelated types. For one endpoint, that looked like this:

type EndpointAResult = EndpointASuccessItems | EndpointAAuthorizationError | EndpointANotFoundError

@dataclass
class EndpointASuccessItems:
    items: list[EndpointASuccessType]
    page_info: EndpointAPageInfo

The challenge was that each error type was a distinct class with no common base class. Without a shared parent, there's no obvious way to write an isinstance check in a generic function that a type checker can use to narrow the type.

This is the problem structural subtyping solves. Since Python 3.8, Protocols let you define a type by what it has, not what it inherits from. For me, having spent most time writing Go, this is a familiar concept very much akin to interfaces.

All the generated error types shared the same fields, but had no common parent:

@dataclass
class EndpointAAuthorizationError:
    error_code: str
    error_message: str

And similarly, every paged response had the same pagination structure:

@dataclass
class EndpointAPageInfo:
    has_next_page: bool
    end_cursor: str | None

So I defined Protocols that captured these shared shapes:

@runtime_checkable
class ErrorResponse(Protocol):
    @property
    def error_code(self) -> str: ...
    @property
    def error_message(self) -> str: ...

class PageInfo(Protocol):
    @property
    def has_next_page(self) -> bool: ...
    @property
    def end_cursor(self) -> str | None: ...

class PagedResult[T](Protocol):
    @property
    def page_info(self) -> PageInfo: ...
    @property
    def items(self) -> list[T]: ...

By default, Protocols are a purely static concept — your type checker understands them, but isinstance doesn't. Adding @runtime_checkable lets you use isinstance(result, ErrorResponse) at runtime, which is what makes narrowing possible.

With those Protocols in place, the generic pagination function is now possible:

_DEFAULT_PAGE_SIZE = 100

def paginate[T](
    fetch: Callable[[int, str | None], PagedResult[T] | ErrorResponse],
    page_size: int = _DEFAULT_PAGE_SIZE,
) -> Iterator[T]:
    after: str | None = None
    while True:
        result = fetch(page_size, after)
        if isinstance(result, ErrorResponse):
            raise APIError(f"Query failed [{result.error_code}]: {result.error_message}")
        yield from result.items
        if not result.page_info.has_next_page:
            break
        after = result.page_info.end_cursor

This handles pagination for any endpoint whose response matches the Protocol and the type checker correctly narrows result to PagedResult[T] after the isinstance check. Aside from being a useful static safety check, it also gives us useful autocompletion in our editor.

Calling it is straightforward:

for item in paginate(lambda first, after: client.get_items(first=first, after=after)):
    print(item)

Python's type system still has gaps that are hard to ignore if you've spent time with TypeScript. Protocols are a potentially underused feature. Any time you're working with codegen output, third-party libraries, or multiple classes that happen to share a structure, Protocols let you write generic, type-safe code without reaching for inheritance or wrapper classes.

How to keep GitLab CI manageable in a large monorepo

2026-03-27T00:00:00Z

Over the past few years I've worked extensively on a large monorepo hosted on GitLab, and at points the experience has been genuinely painful. Pipelines have been a complex web that no human could reason about or safely change with confidence.

Benoit Couetil's GitLab CI: 10+ Best Practices to Avoid Widespread Anti-Patterns is the best single article I've found on GitLab CI. It shaped a lot of how I think about pipeline design, and I agree with nearly all of it. If you haven't read it, go do that first!

I want to revisit two of his recommendations through the lens of working in a large monorepo. On child pipelines, I've landed in a different place. On abstracting duplicated code, I mostly agree with his point — but I want to push it further and make a case for CI/CD components as the better tool for sharing configuration now that they've matured.

Child pipelines are worth it

Couetil recommends avoiding child pipelines. His concerns — clunky UI, limited artifact sharing, added indirection — were valid when he wrote the article, and some of them still are. But in a monorepo with many services, I think child pipelines are essential.

Imagine a monorepo with several services, each with jobs flowing through build → test → deploy_non_prod → integration_tests → deploy_to_prod. In a single flat pipeline, all of those jobs share stages. If a test fails in one service, every other service is blocked, even if they're completely unrelated.

In the above example, a test failure in one service has stopped the entire pipeline. The other three services built successfully and their tests would pass, but they can't progress because they're stuck behind a failure they have nothing to do with. You've coupled the release of unrelated services to each other's pipeline health.

The `needs` trap

The natural reaction is to reach for needs. Wire up explicit dependencies between jobs so each service's build feeds into its own test, which feeds into its own deploy. Unrelated services can progress independently.

This helps with speed and isolation. But as you add services, the dependency graph grows fast — and with it, the mental overhead of understanding what depends on what.

Every arrow here is a needs relationship. Imagine you're the person adding a new service, or introducing a dependency between two existing ones. You need to understand this entire graph to be confident you haven't missed something. A missed dependency means a deployment could run ahead of a test that should have gated it.

This is the stageless pipeline trap, and I've fallen into it firsthand — we'd traded stage-based coupling for cognitive overload.

Isolation at the pipeline level

What we really want is the simplicity of stages but the isolation of our "needs" graph. With child pipelines, each service (or group of related services) gets its own isolated pipeline with its own stages. A failure in one child pipeline doesn't affect another.

Each child pipeline is small. No one needs to hold a large dependency graph in their head.

The parent pipeline's job is simple: trigger the relevant children based on which files changed. We push most jobs down to the children.

Yes, the UI is still frustrating — waiting for a child pipeline to trigger, not seeing stages inline. It'd be nice if children loaded from the outset when the parent is created. GitLab has improved things, but it's still clunkier than I'd like. It's an inconvenience, but the trade-off for isolation is 100% worth it.

Stop nesting `extends`

In the name of DRY code, some end up nesting extends several layers deep, obfuscating what a job is actually doing. You end up grepping across four files, trying to mentally merge YAML that was split apart to avoid repetition. And because extends let you override any field at any level, there's no way to constrain how someone uses a shared template. People override things they shouldn't, and the resulting behaviour is surprising.

I found a particularly incriminating example from a repo I've worked on:

service_a_deploy_to_prod:
  extends:
    - .deploy_to_prod
  environment:
    name: service_a_prod
  needs:
    - service_a_build
    - service_a_deploy_to_staging
    - job: service_a_integration_test_staging
      optional: true
  variables:
    DOMAIN: services/service_a
    PROFILE: prod
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH && $CI_DEPLOY_FREEZE == null
      changes: !reference [.deploy_service_a_globs, changes]

To understand this job, you need to read .deploy_to_prod (itself a meaningless hop to add a level of indirection):

.deploy_to_prod:
  extends: .deploy_to_prod_base

Which extends .deploy_to_prod_base:

.deploy_to_prod_base:
  extends: .deploy_k8s
  stage: deploy_to_prod
  environment:
    name: prod
  variables:
    ENV: prod
    CLOUD_ROLE_ARN: $CLOUD_ROLE_ARN_PROD
    USE_VPN: "1"
    VPN_ENV: prod

Which extends .deploy_k8s:

.deploy_k8s:
  tags: !reference [.runner, tags]
  image: $CI_REGISTRY_IMAGE/ci-deploy
  id_tokens:
    GITLAB_OIDC_TOKEN:
      aud: https://gitlab.com
  before_script:
    - !reference [.assume_cloud_role, before_script]
    - !reference [.enable_vpn, before_script]
    - if [[ $ENV == "default" ]]; then echo "Env not set"; exit 1; fi;
    - make -C infrastructure/k8s update_kubeconfig/$ENV
  script:
    - REPO_ROOT=$(pwd)
    - cd $DOMAIN
    - ${REPO_ROOT}/ci/shared/apply.sh
    - cd $REPO_ROOT
  variables:
    ENV: default
    GIT_STRATEGY: clone

That's four files. Three levels of inheritance. Any field can be overridden at any level. Good luck reviewing a change to .deploy_k8s and being confident about what it affects.

Use CI/CD components instead

CI/CD components solve this more cleanly. A component is a reusable pipeline unit with typed inputs. Instead of inheriting and overriding, you call it with parameters.

Here's the same deployment expressed as a component:

include:
  - component: $CI_SERVER_FQDN/$CI_PROJECT_PATH/kubernetes-deploy@$CI_COMMIT_SHA
    inputs:
      domain: services/service_a
      env: prod
      profile: prod
      cloud_role_arn: $CLOUD_ROLE_ARN_PROD
      vpn_env: prod

And the component itself:

# templates/kubernetes-deploy.yml
spec:
  inputs:
    domain:
      type: string
    env:
      type: string
    profile:
      type: string
      default: ''
    cloud_role_arn:
      type: string
    vpn_env:
      type: string
---
deploy $[[ inputs.env ]] $[[ inputs.domain ]]:
  tags: !reference [.runner, tags]
  image: $CI_REGISTRY_IMAGE/ci-deploy
  stage: deploy_$[[ inputs.env ]]
  id_tokens:
    GITLAB_OIDC_TOKEN:
      aud: https://gitlab.com
  before_script:
    - !reference [.assume_cloud_role, before_script]
    - !reference [.enable_vpn, before_script]
    - if [[ $ENV == "default" ]]; then echo "Env not set"; exit 1; fi;
    - make -C infrastructure/k8s update_kubeconfig/$ENV
  script:
    - REPO_ROOT=$(pwd)
    - cd $DOMAIN
    - ${REPO_ROOT}/ci/shared/apply.sh
    - cd $REPO_ROOT
  variables:
    ENV: $[[ inputs.env ]]
    DOMAIN: $[[ inputs.domain ]]
    PROFILE: $[[ inputs.profile ]]
    CLOUD_ROLE_ARN: $[[ inputs.cloud_role_arn ]]
    GIT_STRATEGY: clone
    USE_VPN: "1"
    VPN_ENV: $[[ inputs.vpn_env ]]

The consumer sees five named inputs. The component author controls what's exposed. Nobody is silently overriding before_script three layers deep in a file you didn't know existed.

Components only went GA in GitLab 17.0, and they're still maturing. But they've already proven to be a much more understandable way to share pipeline configuration than the extends chains they replaced.

Conclusion

None of this is settled wisdom. GitLab keeps shipping changes and improvements — I'm particularly excited to see where Functions go.

If your monorepo has grown to the point where a failure in one service blocks another, or where understanding a single job means mentally reconstructing through a chain of extends clauses, these two changes should make a real difference.