[Data Liberation] WP_Stream_Importer: User-driven incremental import#2013
Merged
Conversation
Exploratory PR to keep track of the import state so that, upon crash, the next run may seamlessly resume where the previous one left off.
…dest entity whose downloads were finalized
…()/seek() methods
91 tasks
stage. Identify downloaded resources by their URL.
…db; Add UI to browse it
zaerl
approved these changes
Nov 28, 2024
Collaborator
zaerl
left a comment
There was a problem hiding this comment.
An excellent step forward, Adam, I like it. Using custom-type posts is a good one. I am ok with merging this. I just left a couple of comments.
| break; | ||
| } | ||
|
|
||
| $post_id = wp_insert_post( |
Collaborator
There was a problem hiding this comment.
Using custom type posts is a great idea. 👍
Collaborator
Author
There was a problem hiding this comment.
Thank you! I was looking for a way to reuse as much of what we already have as possible. A custom table crossed my mind, and we still might need one for the vector clock, but for managing metadata post types and meta seem perfect.
Collaborator
There was a problem hiding this comment.
Yes. In the near future, we will probably need a place to save binary data and similar data. But now, using custom types posts for this is perfectly fine.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds wp-admin support for incrementally importing data from WXR files:
This is a part of #1894
Implementation details
There can be one active import session at any given time. It is started by uploading a WXR file, specifying the URL, and can be extended to any number of data sources. Once created, the admin page shows the current import progress. This PR adds a
WP_Import_Sessionmodel class to store the progress information and the current import cursor.Given an active importing session, the admin page will show the current stage and the number of imported entities accompanied by a "Continue Importing" button. When pressed, it calls
WP_Stream_Importer::next_step()one or more times to perform a small unit of work. After each call, we collect the progress information fromWP_Stream_Importer– be it the number of downloaded asset bytes, the number of inserted database records, the current importing cursor, etc.next_step()returns true when some progress was made, even if that was a failed image download attempt. It returns false when it reaches the end of the current importing stage, at which point theadvance_to_next_stage()method must be called.After each
next_step()oradvance_to_next_stage()call, theWP_Stream_Importer::get_reentrancy_cursor()returns a string that can be used to create a new importer that will resume from the exact same place. The cursor means we got this far, not we got this far and no further. The record the cursor points to may have already been processed. In the upcoming PRs we'll need to either point to the next entity, or invent an idempotent import semantics where processing the same record twice leads to the same outcome as processing it once.Resource Budgets
This PR starts exploring resource budgets by introducing a soft time limit and a minimum number of files downloaded during a single frontloading session. We don't support partial download and resuming yet, so we can't settle for downloading less than one file. On the next attempt we'd just discard the result and likely download less than one file again, meaning we would never get past the frontloading step.
Testing instructions
cd packages/playground/data-liberation/tests/importbash run.shpackages/playground/data-liberation/tests/wxr/a11y-unit-test-data.xml