Technical Manual Query System
Over the past few months, I've been grappling with a persistent operational challenge at work. We maintain an extensive library of operations manuals such as the Operations Manuals Parts A through D, type specific Operations Manuals, Cabin Safety Manual, OCC Manual, Electronic Flight Bag procedures, FCOMs and AFMs for each aircraft type, plus the ever evolving MCAR Air Operations and Aircrew regulations. The sheer volume of documentation creates friction in daily operations. When a line person asks about specific duty time limitations, or when we need to cross reference fuel planning requirements between company procedures and manufacturer recommendations, or when auditors request evidence of regulatory compliance, the cognitive overhead of navigating hundreds of pages across multiple PDFs becomes a bottleneck.
The traditional approach involves opening multiple PDFs, using basic search functions that return every instance of a keyword regardless of context, manually scanning through sections to find relevant information, and then synthesizing answers by mentally cross referencing different manuals. This is inefficient for routine queries and time-consuming during critical situations or audit preparations. I needed a solution that could understand the semantic content of our documentation which respond to natural language queries with properly cited sources.
I'd been following developments in large language models with professional interest. The recent improvements in context windows and document understanding capabilities suggested these models could handle the complexity of aviation technical documentation. The question was whether I could build a practical system without spending significant capital on infrastructure.
The architecture I settled on is what's known in the as Retrieval Augmented Generation. The fundamental concept is straightforward but quite powerful. What it does is, it creates a searchable database of manual content, retrieves the most relevant sections for any given query, and then provides those sections as context to the model which generates a comprehensive answer. This approach combines the precision of traditional database search with the natural language understanding capabilities of modern language models.
The technical implementation required several components working togahter. First, the document processing pipeline that converts PDFs into machine readable chunks while preserving the hierarchical structure of sections and chapters. Second, a vector database that stores these chunks in a way that enables semantic search rather than just keyword matching. Third, the query interface that retrieves relevant chunks and constructs instructions for the model. Fourth, the deployment infrastructure that makes this accessible through a web interface.
I chose to build this on entirely free infrastructure to prove the concept before committing resources. GitHub provides free private repositories for code storage and version control. Streamlit Cloud offers free hosting for Python web applications. Google's API provided generous rate limits for the query volumes we'd actually need. The only costs are time and the cognitive load of learning these systems.
The development environment setup was deliberately minimal. Python 3.12 for the core runtime because it's stable and well-supported. Visual Studio Code as the development environment because it has excellent Git integration and Python tooling. Git for version control because it's the industry standard and integrates seamlessly with GitHub. These are professional-grade tools used by major software companies, but they're all freely available.
The project structure follows standard Python application conventions. A requirements.txt file specifies exact versions of all dependencies to ensure reproducible builds. The main application file app.py contains the Streamlit user interface code. A separate utils.py module encapsulates the document processing and query logic, keeping concerns separated and the codebase maintainable. Configuration files in a .streamlit directory control the visual appearance and deployment settings. A .gitignore file prevents sensitive data and generated files from being committed to version control. The manuals directory holds the actual PDF files, and a vector_db directory stores the processed, searchable version of the content.
The document processing pipeline proved to be the most technically challenging component. Aviation manuals have complex hierarchical structures with nested sections, tables, figures, and references. A naive approach of just extracting all text would lose this structure and create nonsensical chunks that split mid-paragraph or separate related content. My chunking algorithm walks through each page of a PDF, detecting chapter headers with regular expressions that match patterns like "Chapter 3" or "CHAPTER 3", detecting section numbers like "3.2.1" or "3.2.1.5", and creating new chunks at section boundaries at the same time maintaining metadata about which chapter and section each chunk belongs to. Chunks are sized to approximately 800 words, which empirically balances having enough context in each chunk against not overwhelming the model with excessive text. Each chunk stores its contents and metadata indicating the source manual, chapter number, section number, and page number so we can provide precise citations in responses.
The PDF text extraction initially used PyPDF2, a lightweight Python library that's been around for years. It works reliably for straightforward text extraction but has known limitations with complex layouts, tables, and figures. This is actually a fundamental limitation of PDF as a format, it's designed for visual presentation, not semantic structure. A PDF doesn't store "this is a table" or "this is a header", it stores coordinates and glyphs. Text extraction libraries have to reverse-engineer the semantic structure from the visual layout, which is an inherently imperfect process.
For the first iteration, I accepted this limitation. Most queries about procedures, requirements, and textual content work perfectly well. Tables and figures are a known gap that I'll address in future revisions.
The vector database uses ChromaDB, an open-source embedding database. The key concept here is that traditional databases search by exact matching or simple pattern matching, but vector databases enable semantic search. Each chunk of text gets converted into a high-dimensional vector (essentially a list of numbers) that represents its meaning in a way that similar concepts are close together in vector space. When you query "What are the fuel reserve requirements", the system converts your query into a vector and finds the chunks whose vectors are closest, which correspond to chunks that are semantically similar even if they don't contain the exact words "fuel reserve requirements". This is why it can find relevant information even when queries are phrased differently than the manual text.
ChromaDB runs in persistent mode, storing the vector database on disk so it survives application restarts. This is critical for the Streamlit deployment where the app can be paused or restarted by the hosting platform. The database gets created and populated the first time you process manuals, and subsequent restarts just load the existing database rather than reprocessing everything. The processing pipeline is intelligent enough to detect when PDFs have been modified by comparing file hashes, so if you upload an updated revision of a manual, only that manual gets reprocessed rather than everything.
The modem integration uses Google's API through their official Python SDK. I initially attempted to use Anthropic's API since I'm familiar with it, but the subscription model and per-token pricing didn't align well with an operational tool that might see burst usage during audits or irregular query patterns. Google API provides effectively unlimited queries within reasonable professional usage, which removes any psychological barrier to actually using the system. The model I settled on is the Pro, which provides the right balance of capability, speed, and API stability. The newer experimental models have additional features but the API is less stable, which is unacceptable for an operational tool.
The instructions to the API required careful calibration. The system prompt establishes the model's role as an aviation operations assistant for my company. It instructs the model to cite specific manual sections with page numbers, explain relationships between different sections when information is distributed across multiple parts of the documentation, note regulatory references from ICAO, or MCARs, be explicit when information isn't fully covered in the provided excerpts, be especially precise for safety-critical information, and distinguish between procedures for different aircraft types when applicable. The user query and retrieved manual excerpts get formatted into a structured instructions that makes it clear which content is the source material versus the question being asked.
The web interface uses Streamlit, which is essentially a Python framework that turns Python scripts into web applications without requiring knowledge of HTML, CSS, JavaScript, or web frameworks. You write Python code with special Streamlit functions like st.title, st.button, st.text_area, and Streamlit automatically generates the web interface. This is extraordinarily powerful for people who understand the problem domain and can write Python. The interface I built has a sidebar for manual management with a button to process or update manuals, statistics showing how many chunks are in the database and how many manuals are loaded, and a list of which manuals are currently loaded. The main area has a text input for queries, a slider to control how many source chunks to retrieve, a search button, and display areas for the model's answer and the source references.
The deployment to Streamlit Cloud is remarkably straightforward for what it accomplishes. You connect your GitHub account to Streamlit, point it at your repository, specify which Python file is the main application, add any secrets like API keys through their web interface, and click deploy. Streamlit clones your repository, sets up a Python environment, installs dependencies from requirements.txt, and starts your application. The entire process takes three to five minutes. When you push updates to GitHub, Streamlit detects the changes and automatically redeploys. This continuous deployment means the development workflow is just edit code locally, commit to Git, push to GitHub, and the live application updates automatically.
The secrets management through Streamlit's interface is crucial for security. The Google API key never gets committed to the Git repository where it could be exposed. Instead, it's stored in Streamlit's encrypted secrets storage and injected into the application as an environment variable at runtime.
The debugging process revealed several interesting technical challenges. I initialled used an earlier process model version which worked in some contexts but wasn't supported in the stable API version. Changing it a later version resolved it. The next issue was the file path for the manuals directory. The application was looking for "./manuals" which works on my local machine but the working directory on Streamlit Cloud is different. Changing to Path(file).parent / "manuals" made it relative to the application file's location rather than the working directory. The Git workflow issues were entirely about understanding Git's model.
The performance characteristics are interesting to analyze. Initial processing of all manuals took approximately fifteen to twenty minutes for about eighteen documents totaling several thousand pages. This happens once and then the database persists. Adding or updating a single manual takes one to three minutes depending on size. Query response time is typically three to five seconds from clicking search to seeing results, which breaks down to about one second for vector database search, two to four seconds for the model to process the context and generate a response, and negligible time for rendering the interface. For an operational tool, this latency is entirely acceptable.
The citation system provides proper attribution which is essential for aviation operations where regulatory compliance depends on being able to show which approved document a procedure comes from. Each answer includes references formatted as "According OM Part A, Section 3.2.1, page 45..." and below the answer there's an expandable list of all sources that were used, showing the manual name, chapter, section, and page for each. This means when an auditor asks how we determined a particular interpretation of a regulation, we can show them exactly which manuals were consulted and where to verify the information.
The system handles cross-referencing between manuals naturally because it searches across all loaded documents simultaneously. A query like "Compare ATR and A320 fuel reserve requirements" will retrieve relevant chunks from both the ATR OM-B and the A320 OM-B, and the model synthesizes the comparison. This kind of cross-manual analysis is tedious to do manually but happens automatically here.
The known limitations are important to acknowledge. Table extraction from PDFs is unreliable with the current implementation. PyPDF2 sees tables as unstructured text and often mangles the layout. I attempted to upgrade to more sophisticated libraries like PyMuPDF and pdfplumber which have better table detection, but they introduced dependency conflicts in the Streamlit Cloud environment. The proper solution is to leverage a model's native PDF processing capability where you upload the actual PDF pages and it reads them as images, understanding tables and figures visually rather than through text extraction. This is technically feasible but represents a significant architectural change that I'll implement in a future iteration. For now, queries about tabular data may not find the information even though it exists in the manuals.
The system doesn't currently remember conversation history between queries. Each query is independent. If you ask "What are the duty time limits?" and then "What about for night operations?", the second query doesn't know it's a follow-up about duty time limits. Adding conversation context would require maintaining session state and including previous queries and responses in the prompt, which increases token usage and complexity. For operational use, independent queries are actually often preferable because each answer is self-contained with full citations rather than relying on implicit context.
The chunk size of 800 words was chosen empirically but represents a tradeoff. Smaller chunks mean more precise retrieval but potentially missing context that spans longer sections. Larger chunks mean more context but potentially diluting relevance with tangential information. The section-boundary detection helps by preferring to break at natural boundaries rather than mid-paragraph, but long sections may still get split. A more sophisticated approach might use hierarchical chunking where you have detailed chunks for specific information but also broader chunks for context, but the current approach works well enough for operational needs.
The question of data privacy and security is straightforward because the deployment model keeps everything under our control. The manuals are stored in a private GitHub repository. The Streamlit application is deployed with authentication requirements. The vector database is stored on Streamlit's infrastructure and is not shared with other applications. The API sees the manual excerpts we send in queries but doesn't retain them. For truly sensitive content, we could deploy this entire stack on-premises or in our own cloud infrastructure, but the risk profile for operations manuals that are already distributed to flight crews doesn't warrant that complexity.
The extensibility of this architecture is worth considering for future development. Adding more manuals is trivial - just copy PDFs to the manuals directory, commit to Git, and reprocess. The system will detect new files and process them automatically. Changing the model is a single line of code in utils.py if we want to try different versions or add fallback models. The vector database is model-agnostic so we're not locked into any particular service provider. The search parameters like number of chunks to retrieve and chunk size are easily adjustable. The instructions engineering for how the model responds can be continuously refined without touching any infrastructure.
Potential enhancements I'm considering include implementing conversation memory so the system can handle follow-up questions, adding the ability to explicitly specify which manuals to search for targeted queries, implementing different retrieval strategies for different query types - simple factual questions versus complex analytical questions versus comparison requests, adding a feedback mechanism where users can indicate whether answers were helpful which could tune the retrieval parameters over time, implementing role-based access where different users see different manuals based on their position, adding the ability to handle manufacturer service bulletins and airworthiness directives which are time-sensitive additions to the knowledge base, and integrating with our existing operations management systems so relevant manual references automatically populate flight plans or crew briefings.
The development of this system took approximately one week. The initial phase was environment setup, understanding the architecture, and getting a basic prototype working with a single manual. Next was debugging the various deployment issues, adding all the manuals, and tuning the prompt engineering. Ongoing maintenance is essentially zero, once deployed, it runs indefinitely until we need to add or update manuals.
The choice to build this rather than use an off-the-shelf solution was driven by several factors. Commercial document search products exist but they're either expensive enterprise systems overkill for our needs, or consumer products that can't handle the semantic complexity of technical aviation documentation, or services that don't provide the citation accuracy required for regulatory compliance. Now, I control the data, the instructions, the deployment, and the cost structure. I understand exactly how it works which matters when explaining to regulators or auditors how we're using this in operations. And I can extend it to integrate with other company systems without vendor dependencies.
The broader lesson here is that the gap between "idea for an operational tool" and "deployed production system" has collapsed dramatically in the past few years. The limiting factor now is about understanding the problem well enough to architect a solution and having the patience to debug the inevitable issues. There are models out there that have made it possible for domain experts to build sophisticated applications without becoming machine learning specialists. The orchestration tools like Streamlit and deployment platforms like Streamlit Cloud have made it possible to go from Python script to web application without becoming a DevOps engineer. The open source ecosystem provides battle-tested libraries for every component. What used to require a team of specialists and months of development can now be accomplished by a single knowledgeable person in days.
This project represents a category of tools I expect to see proliferate across aviation operations and other regulated industries. The combination of extensive documentation requirements, need for rapid access to accurate information, and recent advances in technology creates obvious use cases. Every airline, every maintenance organization, every training department, every regulatory authority has the same fundamental problem of too much documentation and not enough time to search it effectively. The technical barriers to solving this have essentially disappeared. What remains is the organizational will to experiment and the technical literacy to implement.
The next phase of development for this system will focus on the table and figure extraction problem. The cleanest solution is to modify the query pipeline so that when certain patterns are detected in questions (mentions of tables, figures, charts, specific numeric comparisons) the system retrieves the chunks and the actual PDF pages containing those chunks, sends those pages to the model as images, and gets a response that can see the visual layout. This would perfectly handle tables, figures, diagrams, and any other visual content without trying to extract text from them. The tradeoff is increased API usage and slightly longer query times, but for the subset of queries that actually need this capability, it's worthwhile.
I'm also considering adding a batch processing mode where I can submit a list of common questions and have the system generate a FAQ document with properly cited answers from the manuals. This would be useful for crew briefings, audit preparation, or documentation of standard interpretations of procedures. The underlying technology is the same, it's just automating the query process and formatting the outputs differently.
The regulatory implications of using such technology in flight operations are something I'm monitoring closely. EASA, FAA and other authorities are still developing guidance on modern technology use in safety-critical contexts. This particular application is clearly in the category of decision support rather than autonomous decision-making, it helps humans find information faster but doesn't make operational decisions. The citations and source tracking mean we maintain the same documentary evidence we'd have if someone manually searched the PDFs. But as technological capabilities advance and integration deepens, the regulatory frameworks will need to evolve. Being early in this space means we'll help inform those frameworks through our experience.