← Blog

PDF → PowerPoint — Reconstructing text boxes

Word treats text as a continuous flow that fills the page. PowerPoint treats text as a collection of fixed-size containers (text boxes), each with its own position, size, and styling. The PPT side of conversion has to rebuild the containers from scratch out of a flat sequence of text runs.

What a text box actually is

Every block of text on a slide is a shape, either a generic text box or a placeholder bound to a layout. Each one carries:

Position and size. Fixed. Text does not flow past the borders.
Text. Paragraphs and runs inside the shape.
Formatting. Font, size, color, alignment, padding.
Auto-fit behavior. What happens on overflow.

Word paragraphs grow to hold whatever text they contain. Text boxes do not stretch by default. When the text is too long, it either gets clipped or the font shrinks.

From parsed PDF to grouped paragraphs

The converter starts with the same text runs and coordinates the Word pipeline uses. Grouping runs in three stages: runs into lines (by Y proximity), lines into paragraphs (by proximity and shared indentation), paragraphs into text blocks. The third stage is the new one.

The clustering rule is visual. Paragraphs near each other belong together; paragraphs separated by a large gap or by a non-text object (an image, a graphic) belong to different boxes; paragraphs sharing font, size, and color usually belong together. A practical threshold: split into separate text boxes when the gap between two paragraphs exceeds 3× the line height, or when any non-text object falls between them.

Sizing and positioning the box

The bounding rectangle comes straight from the contained paragraphs:

Left edge. Minimum X across all paragraphs (with a small margin).
Right edge. Maximum X (with a margin).
Top edge. Y of the topmost paragraph.
Bottom edge. Y of the bottom paragraph plus the height of its last line.

The box has to be slightly larger than the content. PowerPoint and PDF render the same font with subtly different metrics (kerning, line height), and a tight box can clip the last few characters of a line. Adding 5–10% padding on width and height absorbs the difference.

Auto-fit

PowerPoint exposes three overflow behaviors:

Do not fit. Text is clipped.
Shrink text on overflow (<a:normAutofit/>). Font shrinks until the text fits.
Resize shape to fit text. The box itself grows.

Placeholders (title and content slots bound to layouts) get “shrink text on overflow” to match the standard templates. Free-floating text boxes get “do not fit,” matching the PowerPoint default and respecting the fixed size the converter just calculated. Resize-to-fit turns predictable layouts into unpredictable ones, so it is rarely useful at conversion time.

Paragraphs and runs inside the box

Paragraphs (<a:p>) work like Word paragraphs:

<a:pPr> carries paragraph properties: indent, alignment, line spacing.
<a:r> runs carry text with <a:rPr> properties: font, size, color, bold, italic.

Glyphs sharing font, size, color, and baseline get bound into one run.

Layout binding changes everything

A text box bound to a layout placeholder inherits the layout’s font and size. The converter shouldn’t set them explicitly, because explicit values override inheritance and break theme switching. With binding, theme switching works (a new theme’s font flows through automatically) and outline view shows the placeholder text.

A free text box has no inheritance to fall back on. The converter has to set font, size, and color explicitly, otherwise PowerPoint applies its defaults, which are usually wrong.

Multi-column text

A page with multi-column text (common in print magazines) gives the converter a choice: one text box with multiple columns, or two independent text boxes.

The first uses PowerPoint’s built-in column support: <a:bodyPr numCols="2" spcCol="360000"><a:noAutofit/></a:bodyPr> (spcCol sets the inter-column gap in EMU). Editing reflows text across columns naturally.

The second matches positions exactly (every word where the PDF put it) at the cost of editability: changes in one column don’t affect the other.

The first option requires correct cross-column reading order. When that works, the multi-column box is better.

Lists

Lists live inside a text box exactly as in Word:

<a:p>
  <a:pPr><a:buChar char="•"/></a:pPr>
  <a:r>...item text...</a:r>
</a:p>

List detection looks for paragraphs that begin with a bullet glyph (•, ◦, –, *) or a numbering pattern (1., a), i.) at consistent indentation. PowerPoint supports multi-level lists and custom bullet markers.

The cases that don’t translate cleanly

Text inside a shape

A PDF can place text on top of or inside another shape (text inside a circle, a label inside an arrow). PowerPoint supports text inside shapes (WordArt and ordinary shapes that accept text). The conversion requires detecting the spatial relationship and binding the text to the shape rather than emitting it as an adjacent text box.

Most converters skip it and emit two independent shapes: one for the form, one text box for the label.

Wrapping around images

PDF can wrap text around an image, with line breaks varying as the lines flow around the obstruction. PowerPoint does not support this inside a text box.

The two workarounds:

Approximation. Split the wrapping text into several boxes positioned around the image. Visually similar. Falls apart at the first edit.
Ignore. Emit one rectangular text box containing only part of the original text. Some text gets dropped.

Most converters take the second route.

Vertical text

PDF supports vertical text rotated 90°. PowerPoint supports vertical text too (Right vertical, Left vertical, Vertical text), with limits. When the converter detects rotation, it sets the corresponding property on the text box.

Headers and footers

If every slide carries the same corner text (slide number, date, author), that block belongs in the slide master. A converter that detects the repetition (the same way Word converters detect headers and footers) can promote it to the master and let the user edit it once.

In practice, the block gets duplicated onto every slide. Editing it requires editing it 30 times.

Slide auto-numbering

PowerPoint slide numbers are dynamic fields:

<a:fld id="..." type="slidenum">5</a:fld>

The id is a GUID; the text content (5) is a fallback that PowerPoint replaces at render time with the live slide number.

A PDF that says “Slide 5 of 30” carries those as plain text. To restore the dynamic behavior, the converter has to detect the pattern across pages (page 5 has “5” in the corner, page 6 has “6”) and replace each instance with a slidenum field. Almost no converter does. Slide numbers stay static, and reordering slides in PowerPoint after conversion does not update them.

What survives, and what doesn’t

Text content and basic formatting transfer with reasonable accuracy. The losses cluster in the structural layer:

Layout binding is the big one. Without it, theme switching does nothing.
Text wrapping around images: gone.
Wrapping around complex shapes: gone.
Dynamic slide numbers: frozen.
Cross-column reading order: sometimes garbled.

If the deliverable needs to be edited, the converter’s ability to bind text to layout placeholders matters more than any other feature. If it just needs to look right, most tools are adequate.