close
Skip to content

ENH: Check whether image is displayed on a given page#3738

Open
andreasntr wants to merge 38 commits into
py-pdf:mainfrom
andreasntr:main
Open

ENH: Check whether image is displayed on a given page#3738
andreasntr wants to merge 38 commits into
py-pdf:mainfrom
andreasntr:main

Conversation

@andreasntr
Copy link
Copy Markdown

@andreasntr andreasntr commented Apr 20, 2026

Addresses #3737

Code brainstormed with Qwen3.5 9B via OpenCode

What was changed

ImageFile now has an is_displayed_on_page(page) method that:

  • checks its content stream for determining whether the image is actually displayed in the given page
  • looks for INLINE IMAGE operators for inline images and Do operators for XObject images
  • uses caching with pages + bools lists, so repeated checks are faster removed as not giving enough advantage

Both inline images and XObject references are supported:

  • inline images: the image name is looked up in the page content
  • image references (XObjects): the image reference is checked against the list of references in the page

Backward compatibility

Checks are performed lazily, so that if a user is not interested in the feature, there is no overhead while reading the PDF or interacting with it.

@andreasntr andreasntr changed the title Check whether image is displayed on a given page DEV: Check whether image is displayed on a given page Apr 20, 2026
@andreasntr andreasntr changed the title DEV: Check whether image is displayed on a given page ENH: Check whether image is displayed on a given page Apr 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 20, 2026

Codecov Report

❌ Patch coverage is 87.83784% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 97.62%. Comparing base (e044789) to head (bb11c8c).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
pypdf/_page.py 87.83% 5 Missing and 4 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3738      +/-   ##
==========================================
- Coverage   97.66%   97.62%   -0.05%     
==========================================
  Files          55       55              
  Lines       10291    10391     +100     
  Branches     1890     1920      +30     
==========================================
+ Hits        10051    10144      +93     
- Misses        135      138       +3     
- Partials      105      109       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andreasntr
Copy link
Copy Markdown
Author

andreasntr commented Apr 20, 2026

@stefan6419846 also added a new test, waiting for #37 to be merged to restart CI jobs (tested locally with pytest though)

Copy link
Copy Markdown
Collaborator

@stefan6419846 stefan6419846 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the issue already, I think we should determine this value directly and save it into the ImageFile instead of doing the heavy lifting of possibly parsing all content streams for each image.

This means: When I get the ImageFile through PageObject.images, just accessing ImageFile.is_displayed should expose the desired value.

If I remember correctly, retrieving the inline images already parses the content stream of the page, thus hooking into it and populating the display values here should be possible without too much overhead.

Do you think this is possible?

@andreasntr
Copy link
Copy Markdown
Author

This means shifting this logic to the pdf/page constructor I think, which implies that even user not interested in this feature would be impacted by the possible overhead (?)

@stefan6419846
Copy link
Copy Markdown
Collaborator

I would hope that a proper approach would not introduce way more overhead than we already require for parsing inline images from the content stream.

@andreasntr
Copy link
Copy Markdown
Author

andreasntr commented Apr 22, 2026

@stefan6419846 switched to a bool field in the ImageFIle class. The flag is set at object creation time:

  • inline images: is_displayed is always True since they don't have references and can not be duplicated
  • reference images (via the images property of each page): the page content is checked to determine whether the reference is actually used

Failing tests have been fixed as well

@stefan6419846
Copy link
Copy Markdown
Collaborator

switched to a bool field in the ImageFIle class. The flag is set at object creation time

Besides now saving it inside a dedicated attribute, we still parse the content stream for each image, thus it produces overhead.

Instead, I would propose to generalize the logic inside _get_inline_images to populate regular image data as well.

@andreasntr
Copy link
Copy Markdown
Author

I would propose to generalize the logic inside _get_inline_images to populate regular image data as well.

How does this remove overhead? We still need to parse each image to compute whether it is displayed or not

@stefan6419846
Copy link
Copy Markdown
Collaborator

We do not need to create a ContentStream and let it generate the operations for each image, but can do this once at the beginning if I am not mistaken.

@andreasntr
Copy link
Copy Markdown
Author

We do not need to create a ContentStream and let it generate the operations for each image, but can do this once at the beginning if I am not mistaken.

Do you mean at document level then? I would need a bit of high-level guidance on this if possible

@stefan6419846
Copy link
Copy Markdown
Collaborator

Do you mean at document level then?

ImageFile instances are generated on the page-level, thus pypdf can have a simple property on it to indicate whether it is displayed on the current page or not.

I would need a bit of high-level guidance on this if possible

_get_inline_images should be renamed to make it obvious that it parses the content stream of the page for image-related data. Currently, it only populates inline images. The goal is to extend its operator handling in

pypdf/pypdf/_page.py

Lines 735 to 744 in 64a793b

for param, ope in content.operations:
if ope == b"INLINE IMAGE":
imgs_data.append(
{"settings": param["settings"], "__streamdata__": param["data"]}
)
elif ope in (b"BI", b"EI", b"ID"): # pragma: no cover
raise PdfReadError(
f"{ope!r} operator met whereas not expected, "
"please share use case with pypdf dev team"
)
to look for Do operators and record the corresponding identifiers (similar to your current approach). These have to be mapped to their respective ImageFile instances afterwards by a suitable approach.

@andreasntr
Copy link
Copy Markdown
Author

This is what I came up with, please confirm before I commit to avoid useless commits.

is_displayed is a boolean property set at ImageFile creation time (not per-call method).

Content stream parsing (_parse_content_stream() -> former _get_inline_images()):

  • Scans page /Contents for BI/EI operators (inline) and Do operators (Do-referenced)
  • Extracts image data and creates ImageFile instances
  • Sets is_displayed=True for both inline and Do-referenced (as they're in the stream)
  • Stores cached ImageFile dict in self.inline_images for later reuse by _get_image() and _get_ids_image()

@stefan6419846
Copy link
Copy Markdown
Collaborator

_parse_content_stream does not indicate that we are dealing with images, which we are indeed are. Thus including the image part in the name would help with future maintenance.

Additionally, please note that _get_inline_images currently only returns ImageFile instances for inline images. For "regular" images, the value needs to be propagated to the constructors for "regular" ImageFile instances accordingly.

@andreasntr
Copy link
Copy Markdown
Author

A property or attribute displayed_images should not be required, as this can be determined from the new ImageFile instances.

This is not true because inline_images only holds inline images (I'm referring to the current main now), while displayed_images also manages xobject/do references.

I understand this. Thus I propose to either have

  • a list of ALL images maintained internally or
  • a list of all inline images maintained internally as well as a list of displayed image names to be able to create ImageFile instances accordingly.

Option 1 is what I implemented here, the issue lies in the name because:

  • we cannot deprecate inline_images
  • we can't have inline_images hold non-inline images because it would disrupt tests and functionalities

So we are in a dead spot

@stefan6419846
Copy link
Copy Markdown
Collaborator

We do not have to rely on the inline_images attribute internally - this is just some arbitrary design decision chosen during the initial implementation.

@andreasntr
Copy link
Copy Markdown
Author

We do not have to rely on the inline_images attribute internally - this is just some arbitrary design decision chosen during the initial implementation.

In theory we could simply do this:

  • restore inline_images
  • keep displayed_images (works as a cache just like inline_images) and let this be returned by images

In practice there is the caveat that images currently returns ALL images even if not displayed or not inline (through _get_ids_image) or just inline images based on whether inline_images is null, giving us the following alternatives:

  • have images rely on displayed_images directly, which breaks the current logic (returning only inline images if inline_images is not null or retuning also referened-but-not-displayed images) but allows easier filtering (by the user) on need
  • images can still be allowed to return only inline images in certain cases, but this invalidates the whole point of having Image.is_displayed for filtering on need

@stefan6419846
Copy link
Copy Markdown
Collaborator

In practice there is the caveat that images currently returns ALL images even if not displayed or not inline (through _get_ids_image) or just inline images based on whether inline_images is null

inline_images is None by default and populated on the first usage. That fact that it has been publicly writable the whole time is a design flaw in my opinion.

My goal would be:

  • Have images always return all images referenced by the page.
  • Populate ImageFile.is_inline and ImageFile.is_displayed properly.
  • Make attributes required for caching purposes (due to having to read the whole content stream) internal.
  • Introduce a deprecation period for inline_images:
    • Make it a property and emulate the current behavior with the new approach.
    • Make the property setter emulate the current behavior which indirectly allowed to control the caching mechanism.

This ensures that there is exactly one public API for retrieving the images and with the object-oriented approach we are able to provide the user the most control without polluting the page API unnecessarily.

@andreasntr
Copy link
Copy Markdown
Author

In practice there is the caveat that images currently returns ALL images even if not displayed or not inline (through _get_ids_image) or just inline images based on whether inline_images is null

inline_images is None by default and populated on the first usage. That fact that it has been publicly writable the whole time is a design flaw in my opinion.

My goal would be:

  • Have images always return all images referenced by the page.
  • Populate ImageFile.is_inline and ImageFile.is_displayed properly.
  • Make attributes required for caching purposes (due to having to read the whole content stream) internal.
  • Introduce a deprecation period for inline_images:
    • Make it a property and emulate the current behavior with the new approach.
    • Make the property setter emulate the current behavior which indirectly allowed to control the caching mechanism.

This ensures that there is exactly one public API for retrieving the images and with the object-oriented approach we are able to provide the user the most control without polluting the page API unnecessarily.

Makes sense, I'll work on this next week. Thank you!

@andreasntr
Copy link
Copy Markdown
Author

andreasntr commented May 16, 2026

@stefan6419846 Updated as required. There are a couple of test failing locally:

  • a file in tests/test_reader.py::test_read_form_416 cannot be downloaded (unknown error handling), causing a ExceptionGroup: multiple unraisable exception warnings (3 sub-exceptions)
  • a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file
  • tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?
  • tests/test_images.py::test_separation_1byte_to_rgb_inverted: I tried looking at the mentioned PR but I couldn't understand why this should be raising a ValueError, can you help?

Please note that I also opened a PR for sample-files to add a test file for _displayed_images (actions emit a style error but it doesn't seem related to my addition)

Comment thread pypdf/_page.py Outdated
Comment thread pypdf/_page.py
lst.extend(list(self.inline_images.keys()))
return lst

# Removes duplicates and preserves order
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? How has this been handled before?

Copy link
Copy Markdown
Author

@andreasntr andreasntr May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously this method returned only inline images. Some tests relying on this method (for example when accessing images) require require inline images at the beginning of the list

EDIT: investigating more, i forgot the reason for the reordering

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO relying on this behavior is false and we might need to improve the tests in this regard.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated response
Ordering is used to return images in the same order as they appear in the page.
As for deduplication, xobjects can potentially be referenced by do-references multiple times: in this case we only return one reference, as the cache is implemented as a map between image name and image value

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any other question on this?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this part becomes obsolete anyway, as our goal is to not load regular images here any more and for inline images, the ordering should not change with the old and new implementation?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll report back on this once implemented

Comment thread pypdf/_page.py Outdated
Comment thread pypdf/_page.py Outdated
Comment thread pypdf/_page.py
Comment thread pypdf/_page.py Outdated
Comment thread pypdf/_page.py
)
# Process Do-referenced images first
files = {}
xobjs: Optional[DictionaryObject] = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this processing here? Couldn't we just record the names and retrieve this data when accessing the image file only?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this way we would be populating the cache only for a subset of image types, excluding xobjects

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to process images lazily, this is what I would do:

  • set the type of _content_stream_images to Optional[dict[str, Optional[ImageFile]]]
  • when reading the page:
    • set all the keys immediately to understand which images are displayed and which are inline
    • set inline images value immediately as they are not expensive to process
    • keep any other image value to None
  • when accessing an image:
    • if cached (either inline or previously accessed), return the cached value
    • if value is None (xobjects), process the page content, retrieve the image content and cache it

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds like the better approach, although I am unsure if we really should introduce caching for regular image files which could be very large.

I probably would have used two internal attributes (_inline_images and _displayed_name) or relied on functools.cached_property instead of implementing our own caching directly.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would those be used by the user? I mean currently the user can rely on the displayed and inline flags. As for the cache, i can investigate

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would still be to deprecate inline_images completely if we are able to and instead let the user use ImageFile.is_inline and ImageFile.is_displayed - IMHO this is the cleanest API. (One improvement would be to maybe make ImageFile a frozen dataclass to make the values final.)

If you think that your other branch is the way to go, please consider marking this PR as draft and open a new PR for the new changes to allow for a clean review process.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honesly I'm not a software expert (i'm a computer scientist, i can code but i work in the ML field so much of the swe stuff is obscure to me) so I didn't know about frozen dataclasses etc: your guidance is precious here. Also, this is your repo after all so I will adapt to what you feel more polished for production. My other branch is based on keeping inline_images and removing extra ImageFile attributes so it wouldn't suit your current requirement.

As I said earlier, I'd prefer you picking a side here and I'll follow with the implementation

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not being specific here. In terms of clean design, having both images and inline_images available on the page sounds unnecessary if we can have a cleaner class-based approach by making is_inline a property of the ImageFile and only providing images as the API endpoint for retrieving page images.

As I said, my preference is the following:

  • Have the two new boolean attributes/properties on the ImageFile.
  • Parse the content stream of the page only once.
  • Store state data in an internal variable.
  • Only cache inline images.
  • Deprecate inline_images in favor of ImageFile.is_inline.

I will leave it up to you which of your current branches serves as the best starting point for resolving this.

Please indicate if some aspects remain unclear for you, so we can have a look at it. Otherwise, please request a review from my side as soon as you think that it makes sense to get this merged in this state.

I didn't know about frozen dataclasses etc

No worries. I have just seen the making it frozen will not work anyway due to how replace_image is implemented. Just ignore this part here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for the clarification.

  • Parse the content stream of the page only once.
  • Only cache inline images.

These two seem conflicting. I understand that inline images can be cached in the current internal property but what about the content stream? Do you mean that it must be parsed once just to get the names of the actually displayed images? We settled on not caching other images since they may be very large, so for those we would still process the content stream every time a user wants to get the image content, am I getting it right or do we want to migrate back to full caching (lazy however)?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, just get the names of actually displayed images from the content stream and cache these values. All images which are not inline have their own stream object and do not need actual parsing of the stream internals, thus we can avoid caching their actual contents.

Comment thread pypdf/_page.py Outdated
Comment thread tests/test_images.py Outdated
Comment thread tests/test_images.py
@stefan6419846
Copy link
Copy Markdown
Collaborator

Updated as required. There are a couple of test failing locally:

  • a file in tests/test_reader.py::test_read_form_416 cannot be downloaded (unknown error handling), causing a ExceptionGroup: multiple unraisable exception warnings (3 sub-exceptions)

I guess this is just a flaky test.

  • a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file
  • tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?

I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.

  • tests/test_images.py::test_separation_1byte_to_rgb_inverted: I tried looking at the mentioned PR but I couldn't understand why this should be raising a ValueError, can you help?

Running this on a system with the main code gives ValueError: not enough image data. The new failure is most likely related to your changes as in the previous cases.

Please note that I also opened a PR for sample-files to add a test file for _displayed_images (actions emit a style error but it doesn't seem related to my addition)

We should update our mypy configuration to exclude the sample files repository.

@andreasntr
Copy link
Copy Markdown
Author

  • a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file
  • tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?

I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.

  • tests/test_images.py::test_separation_1byte_to_rgb_inverted: I tried looking at the mentioned PR but I couldn't understand why this should be raising a ValueError, can you help?

Running this on a system with the main code gives ValueError: not enough image data. The new failure is most likely related to your changes as in the previous cases.

fixed both, the cache wasn't being invalidated after the images are manipulated

@andreasntr
Copy link
Copy Markdown
Author

  • a file in tests/test_filters.py::test_index_lookup gives a OSError: broken data stream when reading image file
  • tests/test_filters.py::test_jpx_no_spacecode: I don't understand how colorspace management can be influenced by the current edits. Probably this should be rewritten/deprecated?

I guess these are failing due to your rewrite, although I have not yet looked further into why. They should keep working without changing the tests itself.

fixed by logging a warning instead of raising an exception as done in generic._image_xobject._xobj_to_image (it was /Im0 that had errors, it was not spotted previously because of the lazy loading)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants