Rishi Desai

Fixing Face Consistency in GPT-4o Image Gen

• Rishi Desai

Demo Available: Try face enhancement yourself here.

The recent release of GPT-4o’s native image generation has stunned the AI community. Give it a reference image of a person and ask for a new scene. It nails the pose, outfit, hair, background, and lighting. Everything… except the face. It’s almost never quite right — often a little off and sometimes completely unrecognizable.

This is problematic because face consistency is the most important part of realistic character image generation.

After Face Enhancement
Original GPT-4o Generated Image
Original
Enhanced

Over the last few weeks, I’ve experimented with ways to fix this. I’ve created a simple post-processing step that dramatically improves facial quality, while keeping everything else intact. If you’re building characters for storytelling or just want your images to actually look like your character, this method is a game changer.

Results

I present several examples to demonstrate my face enhancement process. You’ll see:

  1. The original reference image that was given to GPT-4o
  2. Two sets of before-and-after comparisons showing the enhancement results

Each image pair includes a Face Embed Distance (FED) score. FED measures how closely the generated face matches the reference. Scores range from 0.0 to 1.0, where lower values indicate better matches.

Daenerys Targaryen

Reference Face Image

After Face Enhancement
Original GPT-4o Generated Image
FED: 0.97
FED: 0.56
After Face Enhancement
Original GPT-4o Generated Image
FED: 0.89
FED: 0.50

Timothée Chalamet

Reference Face Image

After Face Enhancement
Original GPT-4o Generated Image
FED: 1.0
FED: 0.36
After Face Enhancement
Original GPT-4o Generated Image
FED: 0.77
FED: 0.44

A Character Generated by Flux.1-dev

Full Body Image Upscaled Face
After Face Enhancement (Face Embed Distance: 0.75)
Original GPT-4o Generated Image (Face Embed Distance: 1.0)
FED: 1.0
FED: 0.75
After Face Enhancement (Face Embed Distance: 0.67)
Original GPT-4o Generated Image (Face Embed Distance: 1.0)
FED: 1.0
FED: 0.67

Face Enhancement

Malformed faces are a persistent challenge in image generation. Traditional face improvement methods are not effective:

  • SDXL inpainting fails because it can’t guarantee facial consistency when working with masked face regions.
  • ControlNet (any mode) struggles to maintain facial consistency.
  • ComfyUI custom nodes like FaceDetailer requires a reasonably formed face as a starting point.

Face swapping the target image’s face with a reference face can be a great option, especially for front-facing images. However, it can struggle with facial expressions, varied perspectives, and lighting. Unlike face swapping, which replaces facial pixels directly, face enhancement regenerates the image with guided facial features via latent diffusion. This provides far more granular control over facial consistency.

Preprocessing

We need a high quality forward facing image of our character. AutoCropFace extracts and crops the face, but the image is blurry and needs upscaling. ESRGAN is a simple approach that provides a good balance of speed and fidelity. It won’t introduce many new details like skin textures or hair styles. fal.ai has a convenient endpoint for face upscaling.

Upscaled Face Image
Cropped Face Image
Original
Upscaled

We also need a detailed image caption of the target image. GPT-4o is the best model for captioning, but it often refuses to caption images of people it recognizes. Florence2 is the best open-source image caption model, and it never refuses to answer.

Face-Guided Image Generation Models

We turn towards image-to-image models that take a face image as input. The three leading Flux-based models are PuLID-Flux, IPAdapter, and ACE++. (We’ll exclude InfiniteYou because it’s only just starting to be incorporated into the ComfyUI ecosystem.) All three models share a foundation, where they extract a face ID embedding using InsightFace and use it to guide the diffusion process to maintain facial consistency. PuLID-Flux is the best of the three; it uses contrastive learning to ensure the model preserves face identity without affecting the ability to follow the text prompt.

Note that Insightface is optimized for the faces of real people. It’s often unable to detect faces in anime characters, cartoons, and non-human subjects. Hence, we’ll only focus on photorealistic images of people.

While PuLID-Flux excels at generating new images from text prompts and a reference face, our task is different. We already have generated images, and we need to enhance their facial quality using our high-quality face reference.

Face Enhancement Pipeline Diagram

Implementation

The ComfyUI community provides an unobvious solution through the ComfyUI-PuLID-Flux custom node. While it’s primarily intended as a PuLID-Flux wrapper, its implementation allows us to do much more. This node patches the forward pass of Flux to inject the face ID embeddings into the diffusion process. More specifically, it uses InsightFace and EVA-CLIP to detect the face and extract the face ID embedding, which is then injected into the diffusion process at specific steps.

We can use the patched Flux model in arbitrary ways within ComfyUI because of the flexibility in model inputs and outputs (in contrast to HuggingFace’s diffusers). Hence, we can leverage it with tiled ControlNet to improve the face quality in any image. We set the positive prompt to our image caption to guide the ControlNet to regenerate the entire image, while enforcing facial consistency based on the reference face embedding. We keep the ControlNet strength high (0.8-1.0), so generation follows the control image strictly. The ID weight in PuLID controls the level of face ID preservation during generation. I found the sweet spot for ID weight between 0.6 and 0.8; values higher than 0.8 tend to smooth out facial expressions in the target image.

Although PuLID is a popular model in the ComfyUI community, I haven’t seen it used in this particular way before. Mickmumpitz’s Flux character consistency workflow is the first I’ve come across that utilizes ComfyUI-PuLID-Flux specifically for face enhancement. It served as a helpful reference point for building my own workflows.

A Metric for Face Quality

Let’s quantify face quality. The Face Analysis custom node is a simple yet robust option. It uses InsightFace to extract Face ID embeddings to compute the Face Embed Distance (FED), the cosine distance between the reference image’s and generated image’s face embedding. It’s scaled to be between 0 and 1.0, where lower values are better. Typically you would calculate the scores between three or four images of the same person to set a baseline value. Since we only have one reference image, we can only compare the relative scores between generated images, not their absolute values.

Face Embed Distance has limitations, as viewing angles and facial expressions lead to different face embeddings. Nevertheless, its simplicity and ease of use on ComfyUI makes it a great proxy for quality.

Details on Productionizing

Productionizing complex ComfyUI workflows with dozens of custom nodes is non-trivial. I use the ComfyUI-to-Python-Extension to directly translate the workflow into a runnable Python script with the same custom node dependencies. Translating workflows into scripts is difficult, so I plan on discussing this topic in greater detail in a future post.

I attempted to replicate the entire ComfyUI workflow in Python using diffusers but couldn’t match ComfyUI’s speed or quality. ComfyUI’s strength lies in its custom node ecosystem, enabling complex model modifications and sampling procedures that are challenging to implement in mainstream Python frameworks. My efforts to use ComfyUI custom nodes as libraries with extensive monkey patching were also unsuccessful due to their deep integration with ComfyUI’s backend.

My base model is Flux.1-dev, ControlNet is Shakker-Labs ControlNet Union, and PuLID model is PuLID-Flux-v0.9.1. I’m running Flux at float16 precision on a single L40S GPU with 48 GB VRAM. Face-enhacement takes around 30 seconds provided the models are loaded into memory.

Parting Thoughts

GPT-4o’s native image gen is incredibly powerful, but its struggle with facial consistency limits its use for applications like storytelling. By leveraging PuLID-FLux and ControlNet, we can easily enhance the face quality in photorealistic images without sacrifing the background, lighting, or other details. This is a helpful step in making GPT-4o a reliable tool for creating visually coherent, character-consistent images.