May 25, 2025

Production-Ready MCP Clients for LLMs

Build robust, scalable MCP integrations with structured validation, error handling, and debugging capabilities

MCP AI Engineering LLM Production Architecture

Production-Ready MCP Clients for LLMs

In previous blog, we explored the basics of MCP and built a simple integration on how to connect your LLM to the MCP server using Pydantic AI. That approach works perfectly for demos and experimentation, but production systems need more robust patterns.If you don’t know what’s happening underneath, you’ll lose visibility and control at the worst time—when things break.

This post focuses on building MCP Client that can handle real-world complexity: multiple concurrent tool calls, validation failures, error recovery, and the debugging visibility you need when things go wrong.

The Production Reality Check

Here's what happens when you move beyond demos:

Tool Selection Failures: Your LLM confidently chooses write_query when it should use read_query, potentially corrupting data.

Parameter Validation Issues: The LLM generates syntactically correct but semantically wrong SQL queries that crash your database.

Concurrent Execution Problems: Multiple tool calls running simultaneously cause race conditions and resource conflicts.

Debugging Nightmares: When something goes wrong, you have no visibility into why the LLM made specific tool choices.

Scale Bottlenecks: Your simple sequential execution pattern can't handle the volume of requests in production.

Let's solve these problems systematically.

Case Study

A E-commerce company deployed an AI Assistant to help the customer service team query their Customer Service database.The demo was flawless—the AI could answer questions like "How many customers signed up last month?" and "Show me John Smith's order history" with perfect accuracy.
Three weeks after launch, their system had:

Created 2,847 duplicate customer records (AI chose INSERT instead of SELECT)
Corrupted 156 order statuses (invalid SQL syntax passed validation)

Root cause ? Their LLM integration had no validation, error handling, visibility into what the AI was actually doing.

Architecture Overview: Structured Tool Call Management

MCP Prod Flow

Instead of letting the LLM directly execute tools through a simple interface, we'll build a structured pipeline:

Tool Discovery & Model Generation: Dynamically create validation models for each tool
Intelligent Tool Selection: Let the LLM choose tools with proper validation
Parameter Validation: Ensure all tool calls have valid parameters before execution
Robust Execution: Handle errors gracefully and provide detailed logging
Result Synthesis: Consolidate results into coherent responses

This approach gives us type safety, error recovery, and the debugging visibility needed for production systems.

Step 1: Structured Tool Definitions for LLM Use

MCP protocol defines a structured format for messages based on JSON-RPC 2.0.

To ensure that the LLM can generate valid inputs for tools, we need to give it a precise schema. That’s where Pydantic models come in — it enforces structure, validate input, and catch errors before execution.

For tool write_query a pydantic model would be

class write_query(BaseModel):
    query: str = Field(..., description="SQL query to execute")

This guides the LLM to return JSON like

{ "query": "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT, age INTEGER)" }

Once the LLM selects the relevant tools, we convert them into ToolDefinition objects. These definitions are later used to generate Pydantic models for structured parameter generation by the LLM.

let's define models that can handle multiple tool calls with proper validation:


class ToolCall(BaseModel):
    """Model for a single tool call with its parameters"""
    name: str
    description: str
    parameters: dict

class ToolCalls(BaseModel):
    """Model for multiple tool calls with comprehensive validation"""
    calls: List[Union[ToolCall]]

    @field_validator("calls")
    def validate_tool_calls(cls, v, info: ValidationInfo):
        # Get available tools from context
        tools: List[ToolDefinition] = info.context.get("tools", [])
        valid_tool_names = [tool.name for tool in tools]
        
        # Check for invalid tool names
        invalid_names = [call.name for call in v if call.name not in valid_tool_names]
        if invalid_names:
            raise ValueError(
                f"Tools {invalid_names} are not valid. Valid tools are: {valid_tool_names}"
            )

        # Prevent overwhelming the system with too many concurrent calls
        if len(v) > 4:
            raise ValueError("You can only select at most 4 tools to call")

        # Check for duplicate tool calls (usually indicates poor planning)
        tool_names = [call.name for call in v]
        if len(tool_names) != len(set(tool_names)):
            raise ValueError("Duplicate tool calls detected - this usually indicates inefficient planning")

        return v

class LLMResponse(BaseModel):
    """Structured response from the final LLM synthesis"""
    answer: str
    confidence: float = 1.0
    tools_used: List[str] = []
    warnings: List[str] = []

This validation layer catches common problems early:

Invalid tool names are rejected with helpful error messages
Resource limits prevent the system from being overwhelmed
Duplicate calls are flagged as potential inefficiencies
Missing context is handled gracefully

Step 2: Dynamic Tool Model Creation

Rather than manually creating Pydantic models for each tool, we can dynamically generate them based on the MCP server's tool schemas. This model acts as a response template for the LLM, guiding it to produce structured parameters tailored to the tool.

Once the parameters are generated, they're used to invoke the tool via the MCP client — completing the reasoning-to-execution loop.:

def create_tool_models(tools: List[ToolDefinition]) -> Dict[str, BaseModel]:
    """
    Creates Pydantic models for each tool based on their schemas.

    Args:
        tools: List of tool definitions from the MCP client

    Returns:
        Dictionary mapping tool names to their Pydantic models
    """
    tool_models = {}
    for tool in tools:
        tool_def = ToolDefinition(
            name=tool.name,
            description=tool.description,
            parameters_json_schema=tool.inputSchema,
        )
        tool_models[tool.name] = create_model_from_tool_schema(tool_def)
    return tool_models

# Example usage:
async def initialize_database_agent():
    """Initialize an agent that can manage customer database operations"""
    config = {
        "mcpServers": {
            "customer_db": {
                "command": "uvx",
                "args": ["mcp-server-sqlite", "--db-path", "customers.sqlite"],
            }
        }
    }
    
    client = Client(config)
    async with client:
        tools = await client.list_tools()
        tool_models = create_tool_models(tools)
        
        print(f"Successfully created models for {len(tool_models)} tools:")
        for name in tool_models.keys():
            print(f"  - {name}")
        
        return client, tools, tool_models

This approach has several advantages:

Automatic synchronization with MCP server capabilities as they evolve over time
Type safety for all tool parameters
Graceful degradation when individual tools have schema issues
Easy debugging of tool model creation problems

Step 3: Intelligent Tool Selection with Context

Instead of letting the LLM choose tools blindly, we'll give it rich context about available tools and guide it toward making good decisions:

async def generate_tool_calls(
    user_query: str, 
    tools: List[ToolDefinition], 
    async_client,
    context: Dict = None
) -> ToolCalls:
    """
    Uses the LLM to generate appropriate tool calls based on the user query.
    Returns: ToolCalls object containing the LLM's chosen tool calls
    """
    system_prompt = f"""You are a helpful assistant that can call tools in response to user requests.
Available tools:
{[f"- {tool.name}: {tool.description}" for tool in tools]}
Guidelines for tool selection:
1. **Read before write**: Always use read_query before write_query to understand data structure
2. **Validate existence**: Use list_tables or describe_table to check if tables/columns exist
3. **Be conservative**: Don't make unnecessary function calls
4. **Think sequentially**: Some operations must happen in order
5. **Maximum 4 tools**: You can select at most 4 tools per request

For each tool call, provide: The appropriate parameters based on the tool's schema
Think step by step about what information you need and in what order."""

    return await async_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query},
        ],
        temperature=0.0,  # Use deterministic output for tool selection
        response_model=ToolCalls,
        context={"tools": tools},
    )

The improved prompting includes:

Detailed tool schemas so the LLM understands exactly what each tool does
Best practice guidelines based on common failure patterns
Sequential thinking encouragement for multi-step operations
Conservative defaults to prevent unnecessary tool execution

Step 4: Robust Tool Execution with Error Handling

Now we'll execute the tool calls with comprehensive error handling and logging:


async def execute_tool_calls(
    tool_calls: ToolCalls,
    tool_models: Dict[str, BaseModel],
    client,
    timeout: int = 30
) -> List[dict]:
    """
    Executes tool calls with robust error handling and detailed logging.
    
    Args:
        tool_calls: The generated tool calls to execute
        tool_models: Dictionary of tool models for validation
        client: The MCP client
        timeout: Timeout for individual tool calls
        
    Returns:
        List of dictionaries containing tool call results and metadata
    """
    results = []
    
    for i, tool_call in enumerate(tool_calls.calls):
        start_time = time.time()
        
        try:
            # Validate tool exists
            if tool_call.name not in tool_models:
                error_msg = f"Tool {tool_call.name} not found in available tools"
                logger.error(error_msg)
                results.append({
                    "tool_name": tool_call.name,
                    "parameters": tool_call.parameters,
                    "response": None,
                    "error": error_msg,
                    "execution_time": 0,
                    "success": False
                })
                continue
            
            # Validate parameters using the tool's Pydantic model
            tool_model = tool_models[tool_call.name]
            try:
                validated_params = tool_model(**tool_call.parameters)
                clean_params = validated_params.model_dump(exclude={"name", "description"})
            except Exception as validation_error:
                error_msg = f"Parameter validation failed for {tool_call.name}: {validation_error}"
                logger.error(error_msg)
                results.append({
                    "tool_name": tool_call.name,
                    "parameters": tool_call.parameters,
                    "response": None,
                    "error": error_msg,
                    "execution_time": 0,
                    "success": False
                })
                continue
            
            # Execute the tool call with timeout
            logger.info(f"Executing tool {tool_call.name} with params: {clean_params}")
            
            try:
                response = await asyncio.wait_for(
                    client.call_tool(tool_call.name, clean_params),
                    timeout=timeout
                )
                execution_time = time.time() - start_time
                
                logger.info(f"Tool {tool_call.name} completed in {execution_time:.2f}s")
                
                results.append({
                    "tool_name": tool_call.name,
                    "parameters": clean_params,
                    "response": response[0] if response else "No response",
                    "error": None,
                    "execution_time": execution_time,
                    "success": True
                })
                
            except asyncio.TimeoutError:
                error_msg = f"Tool {tool_call.name} timed out after {timeout}s"
                logger.error(error_msg)
                results.append({
                    "tool_name": tool_call.name,
                    "parameters": clean_params,
                    "response": None,
                    "error": error_msg,
                    "execution_time": timeout,
                    "success": False
                })
                
            except Exception as execution_error:
                execution_time = time.time() - start_time
                error_msg = f"Tool execution failed: {execution_error}"
                logger.error(f"Tool {tool_call.name} failed after {execution_time:.2f}s: {execution_error}")
                results.append({
                    "tool_name": tool_call.name,
                    "parameters": clean_params,
                    "response": None,
                    "error": error_msg,
                    "execution_time": execution_time,
                    "success": False
                })
        
        except Exception as unexpected_error:
            # Catch-all for any unexpected errors
            execution_time = time.time() - start_time
            error_msg = f"Unexpected error: {unexpected_error}"
            logger.error(f"Unexpected error in tool {tool_call.name}: {unexpected_error}")
            results.append({
                "tool_name": tool_call.name,
                "parameters": tool_call.parameters,
                "response": None,
                "error": error_msg,
                "execution_time": execution_time,
                "success": False
            })
    
    # Log execution summary
    successful_calls = sum(1 for r in results if r["success"])
    total_time = sum(r["execution_time"] for r in results)
    logger.info(f"Executed {len(results)} tool calls: {successful_calls} successful, total time: {total_time:.2f}s")
    
    return results

This execution framework provides:

Parameter validation before any tool execution
Timeout protection to prevent hanging calls
Detailed logging for debugging and monitoring
Graceful error handling that doesn't crash the entire workflow
Performance metrics for optimization

Step 5: Intelligent Response Synthesis

Finally, we'll synthesize the results into a coherent response that acknowledges both successes and failures:

async def generate_final_response(
    user_query: str,
    tool_responses: List[dict],
    async_client
) -> LLMResponse:
    """
    Generates a comprehensive final response that handles both successful 
    and failed tool executions intelligently.
    """
    
    # Separate successful and failed tool calls
    successful_results = [r for r in tool_responses if r["success"]]
    failed_results = [r for r in tool_responses if not r["success"]]
    
    # Build context for the LLM
    context_parts = []
    
    if successful_results:
        context_parts.append("Successful tool executions:")
        for result in successful_results:
            context_parts.append(f"- {result['tool_name']}: {result['response']}")
    
    if failed_results:
        context_parts.append("\nFailed tool executions:")
        for result in failed_results:
            context_parts.append(f"- {result['tool_name']}: {result['error']}")
    
    context = "\n".join(context_parts)
    
    system_prompt = f"""You are analyzing the results of tool executions to answer a user query.

Original query: {user_query}

Tool execution results:
{context}

Instructions:
1. If all tools succeeded, provide a complete answer based on the results
2. If some tools failed, acknowledge the limitations and provide partial answers where possible
3. If critical tools failed, explain what couldn't be determined and why
4. Suggest next steps if the query couldn't be fully answered
5. Be honest about limitations - don't make up information

Provide your confidence level (0.0-1.0) based on how completely you could answer the query."""

    return await async_client.chat.completions.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": "Please analyze these results and provide a comprehensive response."},
        ],
        temperature=0.1,
        response_model=LLMResponse,
    )

Step 6: Complete Production Workflow

Here's how all the pieces fit together in a production-ready system:

import time
import asyncio
from fastmcp import Client

async def production_mcp_workflow(
    user_query: str,
    config: dict,
    max_retries: int = 2
) -> LLMResponse:
    """
    Complete production-ready MCP workflow with error handling and retries.
    """
    
    for attempt in range(max_retries + 1):
        try:
            logger.info(f"Processing query (attempt {attempt + 1}): {user_query}")
            
            # Initialize MCP client and discover tools
            client = Client(config)
            async with client:
                tools = await client.list_tools()
                tool_models = create_tool_models(tools)
                
                if not tools:
                    return LLMResponse(
                        answer="No tools available from the MCP server.",
                        confidence=0.0,
                        warnings=["MCP server returned no tools"]
                    )
                
                # Generate tool calls
                tool_calls = await generate_tool_calls(user_query, tools, async_client)
                logger.info(f"Generated {len(tool_calls.calls)} tool calls: {[call.name for call in tool_calls.calls]}")
                
                # Execute tools
                tool_responses = await execute_tool_calls(tool_calls_response, tool_models, client)
                logger.info(f"Executed {len(tool_responses)} tool calls")
                
                 # Generate final response
                final_response = await generate_final_response(
                    user_query=user_query,
                    tool_responses=tool_responses,
                    async_client=async_client
                )
                
                return final_response
        except Exception as e:
            logger.error(f"Error in workflow (attempt {attempt + 1}): {str(e)}")
            if attempt == max_retries:
                return LLMResponse(
                    answer="I apologize, but I encountered an error while processing your request. Please try again later.",
                    confidence=0.0,
                    warnings=[f"Workflow failed after {max_retries + 1} attempts: {str(e)}"]
                )
            # Wait before retrying
            await asyncio.sleep(1 * (attempt + 1))  # Exponential backoff

User Query : "see if the table animal exists. If it exists, give description of the table"

Response : The table "animals" does exist in the database. It has the following structure:

1. name: Type - TEXT, Nullable - Yes

2. type: Type - TEXT, Nullable - Yes

3. age: Type - INTEGER, Nullable - Yes

This table does not have any primary key defined

DAG-based Execution for Multi-Step Workflows

So far, we've focused on a single-turn tool interaction- but many real-world queries are
multi-step and dependent tool interactions between your LLM agent and an MCP server.

Suppose your LLM agent receives a user query:
"List all tables in the database and describe each one."

With a DAG-based approach:

The agent calls the list_tables tool (root node).
For each table returned, the agent creates a node to call the describe_table tool.
The results are gathered and passed to the LLM for summarization or further reasoning.

This pattern generalizes to any scenario where:

Agents dynamically decide the number of tool calls based on data.
Tools calls can be made only if certain conditions are met.
Output of one tool call influences the flow of next.

I'll explore this DAG-based approach more thoroughly in a later part of this series.

Production MCP Pitfalls

Pitfall #1: The "Demo Magic" Problem

You MCP integration works perfectly in controlled demos but failes with real user queries

Real Example : Support customer bot worked flawlessly when demoed with "Show me customer John Smith's ticket". But when a real user asked "What's up with john smith's stuff?", AI generated

SELECT * FROM customers WHERE name = "john smith's stuff"

Pitfall #2 : The Timeout Cascade

One slow query blocks your entire system.

Real Example: A user asked "Show me all customer data for analysis." The AI generated a query that took 45 seconds to complete, blocking all other requests.

Pitfalls #3: The Context Window Explosion

Tool responses exceed the LLM's context window, causing failures or truncated responses.

Real Example: "Show me all customer tickets" returned 50,000 rows, consuming the entire context window and making the LLM unusable.

Pitfall #4: The Silent Failure Trap

Tools fail silently, and the AI hallucinates responses based on no data.

Real Example: Database query fails due to a locked table, but the AI responds: "John Smith has 3 open tickets" (completely made up).

I recently tweeted about one.

Pitfall #5: The Error Message Black Hole

Unclear error messages make debugging impossible. When your tools doesn't return approapriate error messages for the LLM to parse.

Pitfall #6: The Parameter Validation Illusion

Parameters look correct but contain subtle errors that cause wrong results.

Real Example: AI generates SELECT * FROM customers WHERE created_at = '2024-13-45' (invalid date that SQLite accepts but returns no results).

What's next: The intelligence problem?

You now have a robust, production-ready MCP client that won't crash, corrupt data, or leave you debugging at 3 AM. But there's still one critical question we haven't answered:
How do you know if your AI is making right tool choices?— decision to invoke the right tool with the right parameters still rests on the LLM, which is inherently non-deterministic. Things get tricky when you are dealing with 20 tools and LLM has to decide which tool to pick and in their order for executing them.

In the next part of this series, we'll dive into the overlooked but critical problem: How do you know if your agent is choosing the right tool — and how do you fix it when it doesn't?
We'll explore tool retrieval evaluation, failure patterns, and practical ways to debug and improve tool selection, so your agent not only runs, but runs smart.

💡 Got questions about implementing these patterns? Drop them in comments

This post is part of a series on production AI engineering with MCP. Follow along as we build from basic connections to enterprise-grade AI systems that you can actually depend on.