From 541975bdfa649a95b9a1eea24d4828deb2591929 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 27 Nov 2025 13:34:32 +0000 Subject: [PATCH 1/8] Create KedroCyberpunkBlogPost.md --- .../post/KedroCyberpunkBlogPost.md | 467 ++++++++++++++++++ 1 file changed, 467 insertions(+) create mode 100644 kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md diff --git a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md new file mode 100644 index 00000000..21e9eae3 --- /dev/null +++ b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md @@ -0,0 +1,467 @@ +# Building a Cyberpunk 2077 Knowledge Base with Kedro and LangChain + +There's a well-known adage about writing that tells people to "write what they know." When I had to create a project to test an experimental Kedro dataset for loading LangChain prompt templates, I decided to take that advice to heart. + +I embarked upon the nerdy endeavor of building an LLM-powered question-answering knowledge base whose sole purpose is to accurately answer questions about the action role-playing game Cyberpunk 2077. With over 400 hours of gameplay, every achievement unlocked, and more than a few passionate discussions (read: heated arguments on Reddit) about the game under my belt, this would be the perfect test subject. I could easily spot inaccurate responses, hallucinations, or any other LLM quirks that might slip through. + +To my pleasant surprise, this project would evolve to become a valuable learning experience in building data pipelines with Kedro, wrestling with LLM limitations, and discovering that sometimes the best solutions come from working within constraints rather than around them. + +## The Project + +At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full transcript of a Cyberpunk 2077 playthrough (over 400 pages of dialog), make it searchable, and use it to answer questions accurately. I could only select a specific playthrough on a blog that transcribes games (linked below), and this was a challenge seeing as the game itself has multiple different endings based on the player's choices throughout the story. + +## The Transcript + +The `LangChainPromptDataset` was built to seamlessly integrate LangChain `PromptTemplate` objects into Kedro pipelines, allowing prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. + +### Let's Talk About Chunking... + +When I started, the first issue I encountered was that LLMs have token limits. You can't just dump 400 pages of transcript into a prompt and expect it to work. The transcript needed to be broken down into manageable chunks. + +I started by creating a `process_transcript` Kedro pipeline that handles this transformation: + +```python +def chunk_transcript( + transcript: str, chunk_size: int = 1000, overlap: int = 200 +) -> List[Dict[str, Any]]: + """Split transcript into overlapping chunks for context retrieval.""" + cleaned_transcript = re.sub(r"\n+", "\n", transcript.strip()) + sentences = re.split(r"(?<=[.!?])\s+", cleaned_transcript) + + chunks = [] + start_idx = 0 + + while start_idx < len(sentences): + end_idx = min(start_idx + chunk_size, len(sentences)) + chunk_text = " ".join(sentences[start_idx:end_idx]) + + chunks.append({ + "text": chunk_text, + "chunk_id": len(chunks), + "start_sentence": start_idx, + "end_sentence": end_idx - 1, + }) + + # Move forward with overlap to preserve context + start_idx = max(start_idx + chunk_size - overlap, start_idx + 1) + + return chunks +``` + +The overlap is crucial as it ensures that context spanning chunk boundaries isn't lost. This is a common pattern in RAG systems, but implementing it as a Kedro node made it easy to experiment with different chunk sizes and overlap values through the pipeline `parameters` feature. + +Then I built the project into two separate pipelines: + +1. **`process_transcript`**: Processes raw data once (expensive operation) +2. **`query_pipeline`**: Runs queries repeatedly (cheap operation) + +Kedro's pipeline structure made iteration easier. This separation meant I could process the data once, store it as a Kedro dataset, and then query it as many times as I wanted without reprocessing. The data catalog handles all the file I/O, so storage and loading is straightforward. + +### Chippin' in: The Challenges After Chunking + +The transcript itself led to four fundamental limitations: + +1. **Dialogue-only content**: The 400-page transcript contained only dialogue, missing crucial narrative context about what's actually happening in scenes. + +2. **Single playthrough bias**: As I've mentioned earlier, Cyberpunk 2077 is a game where player choices dramatically alter the story. This transcript was from only one specific playthrough, so it didn't have information about alternative paths or endings. + +3. **The hallucination problem**: When I instructed the LLM to strictly use only the provided context with a low temperature, it would often respond with "I don't have sufficient information." But if I allowed it more freedom or increased the temperature, it would confidently spout off misinformation about the game. + +4. **Naive keyword matching**: My initial approach of using simple keyword matching to find relevant chunks was inadequate. Sometimes completely unrelated chunks would be selected, leading to nonsensical responses about other plot points or characters. What a gonk. + +### Preem Solutions + +Kedro's node-based architecture made it trivial to experiment with different approaches. Each solution became a new node or a modification to an existing one. + +**Solution 1: PartitionedDataset for Better Retrieval** + +I initially stored chunks as individual JSON files. Switching to Kedro's `PartitionedDataset` made retrieval more efficient and the code cleaner: + +```python +def partition_transcript_chunks( + chunks: List[Dict[str, Any]], +) -> Dict[str, Dict[str, Any]]: + """Convert chunks into partition mapping for Kedro's PartitionedDataset.""" + partitions: Dict[str, Dict[str, Any]] = {} + for chunk in chunks: + chunk_id = chunk.get("chunk_id", len(partitions)) + partition_key = f"chunk_{chunk_id}" + partitions[partition_key] = chunk + return partitions +``` + +This change alone improved response quality because the partitioned structure made it easier to search and retrieve specific chunks. + +**Solution 2: Character Name Extraction** + +The transcript's "Character: Dialogue" structure made it easy to extract character names: + +```python +def extract_characters(transcript: str) -> List[str]: + """Extract unique character names from transcript (format: "Name: Dialogue").""" + character_pattern = r"^([A-Za-z\s]+):" + characters = set() + + for line in transcript.split("\n"): + match = re.match(character_pattern, line.strip()) + if match: + character_name = match.group(1).strip() + if character_name and len(character_name) > 1: + characters.add(character_name) + + return sorted(list(characters)) +``` + +I used this character list to boost the relevance score of transcript chunks when a query mentioned a character name. This simple heuristic significantly improved results for character-related questions. + +**Solution 3: Semantic Similarity with Sentence Transformers** + +The most notable improvement came when I replaced keyword matching with semantic similarity search using Sentence Transformers: + +```python +def find_relevant_contexts( + query: str, + transcript_chunks: Dict[str, Any], + wiki_embeddings: Dict[str, Dict[str, Any]], + character_list: List[str], + embedding_model_name: str = "all-MiniLM-L6-v2", + max_chunks: int = 5, + character_bonus: float = 0.05, + wiki_weight: float = 0.7, +) -> List[Dict[str, Any]]: + """Retrieve top relevant contexts using semantic similarity.""" + model = get_embedding_model(embedding_model_name) + query_emb = model.encode(query, convert_to_tensor=True) + + results = [] + # Calculate similarity for transcript chunks and wiki pages... + # Apply character bonus and wiki weight... + + results.sort(key=lambda x: x[0], reverse=True) + return [ + {"source": src, "text": txt, "similarity": sim} + for sim, src, txt in results[:max_chunks] + ] +``` + +This change made response quality significantly better. The model could now understand that "Who is Johnny Silverhand?" and "Tell me about the guy who blew up Arasaka Tower" were asking for similar information, even without exact keyword matches. + +However, even with these improvements, the transcript alone was insufficient. Questions about characters worked reasonably well, but questions about game mechanics, missions, or world-building fell flat. The data simply wasn't there. I needed a better data source. + +## The Wiki + +I needed a source of data that was complete, up-to-date, reliable, and neutral because people get *very* passionate about their in-game choices. The community-maintained Cyberpunk Wiki was the obvious answer. + +### The Download Challenge + +The wiki is substantial as it's about 15,000 pages. Downloading it required: + +1. **Respecting API rate limits**: I had to add intentional pauses between requests to avoid getting blocked. The download script took a couple of hours to complete. + +2. **Choosing the right format**: I settled on a single JSON file where each page is a key-value pair: + ```json + { + "Johnny Silverhand": "Johnny Silverhand is a...", + "Arasaka Tower": "Arasaka Tower is located in...", + ... + } + ``` + This format made it easy to process in Kedro and convert to embeddings. + +3. **Data cleanup**: A significant portion of the work was cleaning the data: + - Removing redirect pages + - Stripping markdown syntax + - Removing image tags, external links, and language links + - Cleaning up formatting artifacts + +I wrote a separate Python script for this cleanup, which was a one-time operation. The cleaned data went into the `data/raw/` directory, ready for Kedro to process. + +### Kedro-Specific Challenges + +This project was already testing one new dataset (`LangChainPromptDataset`), and I wanted to see how far I could push Kedro's built-in datasets before needing external tools. + +The "proper" solution for storing embeddings would be a vector database like Pinecone or Weaviate. But that would require: +1. Writing another custom dataset to interface with the database +2. Setting up and managing external infrastructure +3. Adding complexity to the project + +I decided to test Kedro's limits instead. Could I build a functional RAG system using only Kedro's built-in datasets? + +I stored the cleaned wiki data (about 13,000 entries) in a `JSONDataset`, then added a node to the `process_transcript` pipeline to generate embeddings: + +```python +def embed_wiki_pages( + wiki_data: Dict[str, str], + embedding_model_name: str = "all-MiniLM-L6-v2" +) -> Dict[str, Dict[str, Any]]: + """Generate embeddings for wiki pages using SentenceTransformer.""" + model = get_embedding_model(embedding_model_name) + embedded_pages: Dict[str, Dict[str, Any]] = {} + + for title, text in tqdm(wiki_data.items()): + if not text.strip(): + continue + embedding = model.encode(text, convert_to_numpy=True) + embedded_pages[title] = {"text": text, "embedding": embedding} + + return embedded_pages +``` + +Then I stored these embeddings in a `PickleDataset`: + +```yaml +wiki_embeddings: + type: pickle.PickleDataset + filepath: data/processed/wiki_embeddings.pkl +``` + +This approach has trade-offs: +- ✅ No external dependencies +- ✅ Fast retrieval (everything in memory) +- ✅ No type conversion needed +- ❌ Not scalable for very large datasets +- ❌ Entire dataset loads into memory + +For 13,000 pages (about 11 megabytes of data), I decided, this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's not production-ready for millions of documents, but it proves that Kedro's built-in tools can handle non-trivial workloads well enough. + +### The Results + +Integrating wiki embeddings into the context retrieval node significantly improved the quality of the output. The system could now answer questions about: +- Characters and locations +- Narrative events and mission outcomes +- Game mechanics like weapons, systems, skills, or vehicles +- World-building and lore + +The prompt engineering also became more effective. With better data, I could confidently instruct the LLM to use only the provided context: + +```json +{ + "role": "system", + "content": "You are an expert in Cyberpunk 2077... Using ONLY the information provided from in-game transcripts and wiki entries, answer the user's questions. Do NOT include any information that is not present in the provided context." +} +``` + +Previously, this strict instruction led to many "I don't have sufficient data" responses. Now, with comprehensive wiki data, the LLM had enough context to provide accurate, detailed answers while staying within the provided materials. + +## Making It a Conversation + +At this point, I had a working RAG system, but using it was clunky. Each query required: + +1. Starting a new Kedro session +2. Running the entire pipeline +3. Passing the query as a runtime parameter: `kedro run --pipeline=query_pipeline --params=user_query="Who is Johnny Silverhand?"` + +This was fine for testing, but not very nice for actual use. I wanted something more convenient and user-friendly. + +### The Loop Solution + +Kedro nodes are just Python functions. There's nothing stopping you from putting a loop inside a node: + +```python +def query_llm_cli( + transcript_chunks: Dict[str, Any] = None, + wiki_embeddings: Dict[str, Dict[str, Any]] = None, + character_list: List[str] = None, + # ... other parameters ... +) -> None: + """Interactive conversation loop for CLI chatbot.""" + api_key = get_openai_api_key() + llm = get_llm(api_key=api_key, model=llm_model_name, temperature=llm_temperature) + conversation_history: List[Any] = [] + + while True: + user_query = input("🟢 You: ").strip() + if user_query.lower() in {"exit", "quit"}: + break + + # Find relevant contexts and format prompt + contexts = find_relevant_contexts(query=user_query, ...) + new_messages = format_prompt_with_context(...) + + conversation_history.extend(new_messages) + response = llm.invoke(conversation_history) + + print(f"\n⚪ LLM: {response.content}\n") + conversation_history.append({"role": "ai", "content": response.content}) +``` + +This approach leverages LangChain's `ChatPromptTemplate` (loaded via our `LangChainPromptDataset`) to maintain conversation history. The chatbot now has memory of previous exchanges, making the interaction feel natural and conversational. + +## Games Belong on Discord + +As a stretch goal, I wanted to make this a Discord bot. It seemed fitting. A gaming knowledge base should live where people game. It also brought some interesting insights from an architecture perspective. + +### The Async Challenge + +To get my Kedro runs to interact with Discord, I used Discord.py, an open source Python API wrapper for Discord. + +Discord.py is built on asyncio. Kedro pipeline runs are blocking operations. These two paradigms don't play well together. + +**Solution: Bootstrap and Thread** + +Each Discord command bootstraps its own Kedro session and runs the pipeline in a separate thread: + +```python +def setup_kedro_project() -> Path: + """Bootstrap and configure Kedro project.""" + project_path = Path(__file__).resolve().parent + metadata = bootstrap_project(project_path) + configure_project(metadata.package_name) + return project_path + +@bot.command(name="/query") +async def run_query(ctx: commands.Context, *, user_query: str) -> None: + """Run Kedro query pipeline asynchronously.""" + await ctx.send(f"🚀 Running Kedro pipeline for query: `{user_query}`...") + + project_path = setup_kedro_project() + + def run_kedro() -> Any: + with KedroSession.create( + project_path=project_path, + runtime_params={"user_query": user_query} + ) as session: + return session.run(pipeline_name="query_pipeline", tags=["discord"]) + + result = await asyncio.to_thread(run_kedro) + # ... process and send response ... +``` + +This approach has a nice side effect: each query runs in its own pipeline execution, so multiple users can query the bot simultaneously without interfering with each other. + +### The Message Length Problem + +Discord has a 2000-character limit per message. LLM responses can easily exceed this. The solution was a simple chunking function: + +```python +DISCORD_MAX_MESSAGE_LENGTH = 2000 +DISCORD_SAFE_MESSAGE_LENGTH = 1900 + +async def send_long_message(ctx: commands.Context, message: str) -> None: + """Send message to Discord, chunking if it exceeds 2000 characters.""" + if len(message) <= DISCORD_SAFE_MESSAGE_LENGTH: + await ctx.send(message) + else: + for i in range(0, len(message), DISCORD_MAX_MESSAGE_LENGTH): + await ctx.send(message[i:i+DISCORD_MAX_MESSAGE_LENGTH]) +``` + +### The Pipeline Architecture Challenge + +Here's where I got some interesting insights about Kedro pipeline design. The Discord bot and CLI chatbot needed different behaviors: + +- **CLI**: Interactive loop, maintains conversation history +- **Discord**: Single query, no history, returns string response + +**Attempt 1: Duplicate Pipelines** + +My first instinct was to create separate pipelines. This worked but violated DRY principles and Kedro best practices. Not ideal. In fact, Kedro does not even allow nodes to have the same name even if they're in different pipelines. The framework that exists to apply proper software development practices to data projects was doing its job. + +**Attempt 2: Separate Pipelines with Tags** + +I tried splitting into three pipelines: +1. Context retrieval and prompt assembly (shared) +2. CLI LLM query node +3. Discord LLM query node + +The idea was to use tags to control execution. This failed because I couldn't guarantee execution order—sometimes the LLM node would run before context retrieval, leading to hallucinations from empty context. + +**Solution: Single Pipeline with Tagged Nodes** + +The chosen approach was a single pipeline with all nodes, using tags to select execution paths: + +```python +def create_pipeline() -> Pipeline: + return Pipeline([ + Node( + func=find_relevant_contexts, + inputs=[...], + outputs="relevant_contexts", + tags=["cli", "discord"], # Shared node + ), + Node( + func=format_prompt_with_context, + inputs=[...], + outputs="formatted_prompt", + tags=["cli", "discord"], # Shared node + ), + Node( + func=query_llm_cli, + inputs=[...], + outputs="llm_response_cli", + tags=["cli"], # CLI-only + ), + Node( + func=query_llm_discord, + inputs=[...], + outputs="llm_response_discord", + tags=["discord"], # Discord-only + ), + ]) +``` + +Now I can run: +- `kedro run --pipeline=query_pipeline --tags=cli` for CLI mode +- `kedro run --pipeline=query_pipeline --tags=discord` for Discord mode + +Kedro's tag system ensures only the appropriate nodes run, and execution order is guaranteed because nodes are connected by their inputs/outputs. This is the Kedro way: use the framework's features to solve problems elegantly. + +## Lessons Learned + +This project started with building a simple query system to test a new dataset. It ended up being a very insightful learning experience. + +### 1. Separation of Concerns Is Important + +Having separate pipelines for processing and querying meant I could iterate on query logic without reprocessing 400 pages of transcript and 13,000 wiki pages. This separation, enforced by Kedro's architecture, was a great time saver. + +### 2. The Data Catalog Is Your Friend + +Never hardcode file paths. The catalog makes it trivial to: +- Change data locations +- Use different datasets for testing vs. production +- Track data lineage + +### 3. Parameters Enable Experimentation + +Moving magic numbers to `parameters.yml` made it easy to experiment: +- What's the optimal chunk size? +- How much overlap is needed? +- What temperature works best for this use case? + +Change a parameter, rerun the pipeline. No code changes needed. + +### 4. Nodes Are Just Functions + +Don't overthink it. A node can be a simple transformation, a complex loop, or anything in between. Kedro provides structure, not restrictions. + +### 5. Tags Are Powerful + +The tag system solved a real architectural problem elegantly. It's a simple feature with powerful implications for pipeline organization. + +## Conclusion: From Experiment to Production Pattern + +What started as a test of an experimental dataset became a comprehensive exploration of building production-ready data pipelines with Kedro. The project demonstrates: + +- **RAG system architecture** using semantic search +- **Custom dataset integration** with LangChain +- **Pipeline organization** for different execution modes +- **Async integration** with blocking operations +- **Practical constraints** and trade-offs in real systems + +The code is clean, maintainable, and follows Kedro best practices. More importantly, it works. The bot can answer questions about Cyberpunk 2077, drawing from both the game transcript and comprehensive wiki data. + +And after 466 hours of gameplay and every achievement unlocked, I can confirm: the bot's answers are accurate. Now if only it could tell me when the sequel's release date is going to be. + +## Resources + +- [Kedro Documentation](https://docs.kedro.org/) - Comprehensive guide to Kedro +- [LangChain Documentation](https://python.langchain.com/) - LLM framework +- [Discord.py Documentation](https://discordpy.readthedocs.io/) - Discord bot library +- [Sentence Transformers](https://www.sbert.net/) - Semantic embeddings +- [OpenAI API](https://platform.openai.com/docs) - LLM provider +- [Cyberpunk Wiki](https://cyberpunk.fandom.com/wiki/Cyberpunk_Wiki) - Game information source +- [Full Cyberpunk 2077 Transcript](https://game-scripts-wiki.blogspot.com/2020/12/cyberpunk-2077-full-transcript.html) - Transcript source + +--- + +*The complete project is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases.* From 2dfe82b4ba1f6942eb117fe44402592bcb6dbd2d Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 27 Nov 2025 13:56:53 +0000 Subject: [PATCH 2/8] Update KedroCyberpunkBlogPost.md --- .../post/KedroCyberpunkBlogPost.md | 162 ++++++++++-------- 1 file changed, 87 insertions(+), 75 deletions(-) diff --git a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md index 21e9eae3..da15a636 100644 --- a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md +++ b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md @@ -1,20 +1,20 @@ -# Building a Cyberpunk 2077 Knowledge Base with Kedro and LangChain +# Building a cyberpunk 2077 knowledge base with Kedro and LangChain -There's a well-known adage about writing that tells people to "write what they know." When I had to create a project to test an experimental Kedro dataset for loading LangChain prompt templates, I decided to take that advice to heart. +There's a well-known adage about writing that tells people to "write what they know." When I had to create a project to test an experimental Kedro dataset for loading LangChain prompt templates, I decided to take that advice to heart. -I embarked upon the nerdy endeavor of building an LLM-powered question-answering knowledge base whose sole purpose is to accurately answer questions about the action role-playing game Cyberpunk 2077. With over 400 hours of gameplay, every achievement unlocked, and more than a few passionate discussions (read: heated arguments on Reddit) about the game under my belt, this would be the perfect test subject. I could easily spot inaccurate responses, hallucinations, or any other LLM quirks that might slip through. +I embarked upon the nerdy endeavor of building an LLM-powered question-answering knowledge base whose sole purpose is to accurately answer questions about the action role-playing game Cyberpunk 2077. With over 400 hours of gameplay, every achievement unlocked, and more than a few passionate discussions (read: heated arguments on Reddit) about the game under my belt, this would be the perfect test subject. I could easily spot inaccurate responses, hallucinations, or any other LLM quirks that might slip through. To my pleasant surprise, this project would evolve to become a valuable learning experience in building data pipelines with Kedro, wrestling with LLM limitations, and discovering that sometimes the best solutions come from working within constraints rather than around them. -## The Project +## The project At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full transcript of a Cyberpunk 2077 playthrough (over 400 pages of dialog), make it searchable, and use it to answer questions accurately. I could only select a specific playthrough on a blog that transcribes games (linked below), and this was a challenge seeing as the game itself has multiple different endings based on the player's choices throughout the story. -## The Transcript +## The transcript The `LangChainPromptDataset` was built to seamlessly integrate LangChain `PromptTemplate` objects into Kedro pipelines, allowing prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. -### Let's Talk About Chunking... +### Let's talk about chunking... When I started, the first issue I encountered was that LLMs have token limits. You can't just dump 400 pages of transcript into a prompt and expect it to work. The transcript needed to be broken down into manageable chunks. @@ -55,25 +55,25 @@ Then I built the project into two separate pipelines: 1. **`process_transcript`**: Processes raw data once (expensive operation) 2. **`query_pipeline`**: Runs queries repeatedly (cheap operation) -Kedro's pipeline structure made iteration easier. This separation meant I could process the data once, store it as a Kedro dataset, and then query it as many times as I wanted without reprocessing. The data catalog handles all the file I/O, so storage and loading is straightforward. +Kedro's pipeline structure made iteration easier. This separation meant I could process the data once, store it as a Kedro dataset, and then query it as many times as I wanted without reprocessing. The data catalog handles all the file I/O, so storage and loading are straightforward. -### Chippin' in: The Challenges After Chunking +### Chippin' in: the challenges after chunking The transcript itself led to four fundamental limitations: -1. **Dialogue-only content**: The 400-page transcript contained only dialogue, missing crucial narrative context about what's actually happening in scenes. +1. **Dialog-only content**: The 400-page transcript contained only dialog, missing crucial narrative context about what's actually happening in scenes. 2. **Single playthrough bias**: As I've mentioned earlier, Cyberpunk 2077 is a game where player choices dramatically alter the story. This transcript was from only one specific playthrough, so it didn't have information about alternative paths or endings. -3. **The hallucination problem**: When I instructed the LLM to strictly use only the provided context with a low temperature, it would often respond with "I don't have sufficient information." But if I allowed it more freedom or increased the temperature, it would confidently spout off misinformation about the game. +3. **The hallucination problem**: When I instructed the LLM to strictly use only the provided context with a low temperature, it would often respond with "I don't have sufficient information." But if I allowed it more freedom or increased the temperature, it would confidently spout misinformation about the game. 4. **Naive keyword matching**: My initial approach of using simple keyword matching to find relevant chunks was inadequate. Sometimes completely unrelated chunks would be selected, leading to nonsensical responses about other plot points or characters. What a gonk. -### Preem Solutions +### Preem solutions Kedro's node-based architecture made it trivial to experiment with different approaches. Each solution became a new node or a modification to an existing one. -**Solution 1: PartitionedDataset for Better Retrieval** +**Solution 1: PartitionedDataset for better retrieval** I initially stored chunks as individual JSON files. Switching to Kedro's `PartitionedDataset` made retrieval more efficient and the code cleaner: @@ -92,7 +92,7 @@ def partition_transcript_chunks( This change alone improved response quality because the partitioned structure made it easier to search and retrieve specific chunks. -**Solution 2: Character Name Extraction** +**Solution 2: Character name extraction** The transcript's "Character: Dialogue" structure made it easy to extract character names: @@ -114,7 +114,7 @@ def extract_characters(transcript: str) -> List[str]: I used this character list to boost the relevance score of transcript chunks when a query mentioned a character name. This simple heuristic significantly improved results for character-related questions. -**Solution 3: Semantic Similarity with Sentence Transformers** +**Solution 3: Semantic similarity with Sentence Transformers** The most notable improvement came when I replaced keyword matching with semantic similarity search using Sentence Transformers: @@ -148,17 +148,18 @@ This change made response quality significantly better. The model could now unde However, even with these improvements, the transcript alone was insufficient. Questions about characters worked reasonably well, but questions about game mechanics, missions, or world-building fell flat. The data simply wasn't there. I needed a better data source. -## The Wiki +## The wiki I needed a source of data that was complete, up-to-date, reliable, and neutral because people get *very* passionate about their in-game choices. The community-maintained Cyberpunk Wiki was the obvious answer. -### The Download Challenge +### The download challenge -The wiki is substantial as it's about 15,000 pages. Downloading it required: +The wiki is substantial, as it's about 15,000 pages. Downloading it required: 1. **Respecting API rate limits**: I had to add intentional pauses between requests to avoid getting blocked. The download script took a couple of hours to complete. 2. **Choosing the right format**: I settled on a single JSON file where each page is a key-value pair: + ```json { "Johnny Silverhand": "Johnny Silverhand is a...", @@ -166,21 +167,24 @@ The wiki is substantial as it's about 15,000 pages. Downloading it required: ... } ``` + This format made it easy to process in Kedro and convert to embeddings. 3. **Data cleanup**: A significant portion of the work was cleaning the data: - - Removing redirect pages - - Stripping markdown syntax - - Removing image tags, external links, and language links - - Cleaning up formatting artifacts + + * Removing redirect pages + * Stripping Markdown syntax + * Removing image tags, external links, and language links + * Cleaning up formatting artifacts I wrote a separate Python script for this cleanup, which was a one-time operation. The cleaned data went into the `data/raw/` directory, ready for Kedro to process. -### Kedro-Specific Challenges +### Kedro-specific challenges This project was already testing one new dataset (`LangChainPromptDataset`), and I wanted to see how far I could push Kedro's built-in datasets before needing external tools. The "proper" solution for storing embeddings would be a vector database like Pinecone or Weaviate. But that would require: + 1. Writing another custom dataset to interface with the database 2. Setting up and managing external infrastructure 3. Adding complexity to the project @@ -216,21 +220,23 @@ wiki_embeddings: ``` This approach has trade-offs: -- ✅ No external dependencies -- ✅ Fast retrieval (everything in memory) -- ✅ No type conversion needed -- ❌ Not scalable for very large datasets -- ❌ Entire dataset loads into memory -For 13,000 pages (about 11 megabytes of data), I decided, this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's not production-ready for millions of documents, but it proves that Kedro's built-in tools can handle non-trivial workloads well enough. +* ✅ No external dependencies +* ✅ Fast retrieval (everything in memory) +* ✅ No type conversion needed +* ❌ Not scalable for very large datasets +* ❌ Entire dataset loads into memory + +For 13,000 pages (about 11 megabytes of data), I decided this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's not production-ready for millions of documents, but it proves that Kedro's built-in tools can handle non-trivial workloads well enough. -### The Results +### The results Integrating wiki embeddings into the context retrieval node significantly improved the quality of the output. The system could now answer questions about: -- Characters and locations -- Narrative events and mission outcomes -- Game mechanics like weapons, systems, skills, or vehicles -- World-building and lore + +* Characters and locations +* Narrative events and mission outcomes +* Game mechanics like weapons, systems, skills, or vehicles +* World-building and lore The prompt engineering also became more effective. With better data, I could confidently instruct the LLM to use only the provided context: @@ -243,7 +249,7 @@ The prompt engineering also became more effective. With better data, I could con Previously, this strict instruction led to many "I don't have sufficient data" responses. Now, with comprehensive wiki data, the LLM had enough context to provide accurate, detailed answers while staying within the provided materials. -## Making It a Conversation +## Making it a conversation At this point, I had a working RAG system, but using it was clunky. Each query required: @@ -251,9 +257,9 @@ At this point, I had a working RAG system, but using it was clunky. Each query r 2. Running the entire pipeline 3. Passing the query as a runtime parameter: `kedro run --pipeline=query_pipeline --params=user_query="Who is Johnny Silverhand?"` -This was fine for testing, but not very nice for actual use. I wanted something more convenient and user-friendly. +This was fine for testing but not very nice for actual use. I wanted something more convenient and user-friendly. -### The Loop Solution +### The loop solution Kedro nodes are just Python functions. There's nothing stopping you from putting a loop inside a node: @@ -287,17 +293,17 @@ def query_llm_cli( This approach leverages LangChain's `ChatPromptTemplate` (loaded via our `LangChainPromptDataset`) to maintain conversation history. The chatbot now has memory of previous exchanges, making the interaction feel natural and conversational. -## Games Belong on Discord +## Games belong on Discord As a stretch goal, I wanted to make this a Discord bot. It seemed fitting. A gaming knowledge base should live where people game. It also brought some interesting insights from an architecture perspective. -### The Async Challenge +### The async challenge -To get my Kedro runs to interact with Discord, I used Discord.py, an open source Python API wrapper for Discord. +To get my Kedro runs to interact with Discord, I used Discord.py, an open-source Python API wrapper for Discord. Discord.py is built on asyncio. Kedro pipeline runs are blocking operations. These two paradigms don't play well together. -**Solution: Bootstrap and Thread** +**Solution: bootstrap and thread** Each Discord command bootstraps its own Kedro session and runs the pipeline in a separate thread: @@ -329,7 +335,7 @@ async def run_query(ctx: commands.Context, *, user_query: str) -> None: This approach has a nice side effect: each query runs in its own pipeline execution, so multiple users can query the bot simultaneously without interfering with each other. -### The Message Length Problem +### The message length problem Discord has a 2000-character limit per message. LLM responses can easily exceed this. The solution was a simple chunking function: @@ -346,27 +352,28 @@ async def send_long_message(ctx: commands.Context, message: str) -> None: await ctx.send(message[i:i+DISCORD_MAX_MESSAGE_LENGTH]) ``` -### The Pipeline Architecture Challenge +### The pipeline architecture challenge Here's where I got some interesting insights about Kedro pipeline design. The Discord bot and CLI chatbot needed different behaviors: -- **CLI**: Interactive loop, maintains conversation history -- **Discord**: Single query, no history, returns string response +* **CLI**: Interactive loop, maintains conversation history +* **Discord**: Single query, no history, returns string response -**Attempt 1: Duplicate Pipelines** +**Attempt 1: duplicate pipelines** My first instinct was to create separate pipelines. This worked but violated DRY principles and Kedro best practices. Not ideal. In fact, Kedro does not even allow nodes to have the same name even if they're in different pipelines. The framework that exists to apply proper software development practices to data projects was doing its job. -**Attempt 2: Separate Pipelines with Tags** +**Attempt 2: separate pipelines with tags** I tried splitting into three pipelines: + 1. Context retrieval and prompt assembly (shared) 2. CLI LLM query node 3. Discord LLM query node The idea was to use tags to control execution. This failed because I couldn't guarantee execution order—sometimes the LLM node would run before context retrieval, leading to hallucinations from empty context. -**Solution: Single Pipeline with Tagged Nodes** +**Solution: single pipeline with tagged nodes** The chosen approach was a single pipeline with all nodes, using tags to select execution paths: @@ -401,52 +408,55 @@ def create_pipeline() -> Pipeline: ``` Now I can run: -- `kedro run --pipeline=query_pipeline --tags=cli` for CLI mode -- `kedro run --pipeline=query_pipeline --tags=discord` for Discord mode -Kedro's tag system ensures only the appropriate nodes run, and execution order is guaranteed because nodes are connected by their inputs/outputs. This is the Kedro way: use the framework's features to solve problems elegantly. +* `kedro run --pipeline=query_pipeline --tags=cli` for CLI mode +* `kedro run --pipeline=query_pipeline --tags=discord` for Discord mode + +Kedro's tag system ensures only the appropriate nodes run, and execution order is guaranteed because nodes are connected by their inputs and outputs. This is the Kedro way: use the framework's features to solve problems elegantly. -## Lessons Learned +## Lessons learned This project started with building a simple query system to test a new dataset. It ended up being a very insightful learning experience. -### 1. Separation of Concerns Is Important +### 1. Separation of concerns is important Having separate pipelines for processing and querying meant I could iterate on query logic without reprocessing 400 pages of transcript and 13,000 wiki pages. This separation, enforced by Kedro's architecture, was a great time saver. -### 2. The Data Catalog Is Your Friend +### 2. The data catalog is your friend Never hardcode file paths. The catalog makes it trivial to: -- Change data locations -- Use different datasets for testing vs. production -- Track data lineage -### 3. Parameters Enable Experimentation +* Change data locations +* Use different datasets for testing vs. production +* Track data lineage + +### 3. Parameters enable experimentation Moving magic numbers to `parameters.yml` made it easy to experiment: -- What's the optimal chunk size? -- How much overlap is needed? -- What temperature works best for this use case? + +* What's the optimal chunk size? +* How much overlap is needed? +* What temperature works best for this use case? Change a parameter, rerun the pipeline. No code changes needed. -### 4. Nodes Are Just Functions +### 4. Nodes are just functions Don't overthink it. A node can be a simple transformation, a complex loop, or anything in between. Kedro provides structure, not restrictions. -### 5. Tags Are Powerful +### 5. Tags are powerful The tag system solved a real architectural problem elegantly. It's a simple feature with powerful implications for pipeline organization. -## Conclusion: From Experiment to Production Pattern +## Conclusion: from experiment to production pattern What started as a test of an experimental dataset became a comprehensive exploration of building production-ready data pipelines with Kedro. The project demonstrates: -- **RAG system architecture** using semantic search -- **Custom dataset integration** with LangChain -- **Pipeline organization** for different execution modes -- **Async integration** with blocking operations -- **Practical constraints** and trade-offs in real systems +* **RAG system architecture** using semantic search +* **Custom dataset integration** with LangChain +* **Pipeline organization** for different execution modes +* **Async integration** with blocking operations +* **Practical constraints** and trade-offs in real systems The code is clean, maintainable, and follows Kedro best practices. More importantly, it works. The bot can answer questions about Cyberpunk 2077, drawing from both the game transcript and comprehensive wiki data. @@ -454,14 +464,16 @@ And after 466 hours of gameplay and every achievement unlocked, I can confirm: t ## Resources -- [Kedro Documentation](https://docs.kedro.org/) - Comprehensive guide to Kedro -- [LangChain Documentation](https://python.langchain.com/) - LLM framework -- [Discord.py Documentation](https://discordpy.readthedocs.io/) - Discord bot library -- [Sentence Transformers](https://www.sbert.net/) - Semantic embeddings -- [OpenAI API](https://platform.openai.com/docs) - LLM provider -- [Cyberpunk Wiki](https://cyberpunk.fandom.com/wiki/Cyberpunk_Wiki) - Game information source -- [Full Cyberpunk 2077 Transcript](https://game-scripts-wiki.blogspot.com/2020/12/cyberpunk-2077-full-transcript.html) - Transcript source +* [Kedro documentation](https://docs.kedro.org/) - Comprehensive guide to Kedro +* [LangChain documentation](https://python.langchain.com/) - LLM framework +* [Discord.py documentation](https://discordpy.readthedocs.io/) - Discord bot library +* [Sentence Transformers](https://www.sbert.net/) - Semantic embeddings +* [OpenAI API](https://platform.openai.com/docs) - LLM provider +* [Cyberpunk Wiki](https://cyberpunk.fandom.com/wiki/Cyberpunk_Wiki) - Game information source +* [Full Cyberpunk 2077 transcript](https://game-scripts-wiki.blogspot.com/2020/12/cyberpunk-2077-full-transcript.html) - Transcript source --- *The complete project is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases.* + + From 2dade353709ab301cd67bab832497f06fd7e670a Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Thu, 27 Nov 2025 14:07:46 +0000 Subject: [PATCH 3/8] Update KedroCyberpunkBlogPost.md --- .../post/KedroCyberpunkBlogPost.md | 30 +++++++------------ 1 file changed, 11 insertions(+), 19 deletions(-) diff --git a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md index da15a636..0d362165 100644 --- a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md +++ b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md @@ -1,18 +1,20 @@ # Building a cyberpunk 2077 knowledge base with Kedro and LangChain -There's a well-known adage about writing that tells people to "write what they know." When I had to create a project to test an experimental Kedro dataset for loading LangChain prompt templates, I decided to take that advice to heart. +There's a well-known adage about writing that tells people to "write what they know." When I decided to create a project to test an experimental [Kedro dataset](https://docs.kedro.org/projects/kedro-datasets/) for loading [LangChain](https://python.langchain.com/) prompt templates, I took that advice to heart. -I embarked upon the nerdy endeavor of building an LLM-powered question-answering knowledge base whose sole purpose is to accurately answer questions about the action role-playing game Cyberpunk 2077. With over 400 hours of gameplay, every achievement unlocked, and more than a few passionate discussions (read: heated arguments on Reddit) about the game under my belt, this would be the perfect test subject. I could easily spot inaccurate responses, hallucinations, or any other LLM quirks that might slip through. +I embarked upon the nerdy endeavor of building an LLM-powered question-answering knowledge base whose sole purpose is to accurately answer questions about the action role-playing game Cyberpunk 2077. This would be the perfect test subject: I have over 450 hours of gameplay, every achievement unlocked, and have had more than a few passionate discussions (or heated arguments) on Reddit about the game under my belt. I could easily spot inaccurate responses, hallucinations, or any other LLM quirks that might slip through. To my pleasant surprise, this project would evolve to become a valuable learning experience in building data pipelines with Kedro, wrestling with LLM limitations, and discovering that sometimes the best solutions come from working within constraints rather than around them. +I decided to write up what I learned to share with the Kedro community. Hopefully you'll find this useful if you have and are trying to achieve . If you want to try it out, you can find the the project [on GitHub as part of the Kedro Academy repository](https://github.com/kedro-org/kedro-academy/tree/main/kedro-cyberpunk-knowledge-base). + ## The project -At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full transcript of a Cyberpunk 2077 playthrough (over 400 pages of dialog), make it searchable, and use it to answer questions accurately. I could only select a specific playthrough on a blog that transcribes games (linked below), and this was a challenge seeing as the game itself has multiple different endings based on the player's choices throughout the story. +At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full [transcript of a Cyberpunk 2077 playthrough](https://game-scripts-wiki.blogspot.com/2020/12/cyberpunk-2077-full-transcript.html) (over 400 pages of dialog), make it searchable, and use it to answer questions accurately. I could only select a specific playthrough on a blog that transcribes games (linked below), and this was a challenge seeing as the game itself has multiple different endings based on the player's choices throughout the story. ## The transcript -The `LangChainPromptDataset` was built to seamlessly integrate LangChain `PromptTemplate` objects into Kedro pipelines, allowing prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. +The `LangChainPromptDataset` was built to seamlessly integrate LangChain `PromptTemplate` objects into Kedro pipelines, enabling prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. ### Let's talk about chunking... @@ -116,7 +118,7 @@ I used this character list to boost the relevance score of transcript chunks whe **Solution 3: Semantic similarity with Sentence Transformers** -The most notable improvement came when I replaced keyword matching with semantic similarity search using Sentence Transformers: +The most notable improvement came when I replaced keyword matching with semantic similarity search using [Sentence Transformers](https://www.sbert.net/): ```python def find_relevant_contexts( @@ -150,7 +152,7 @@ However, even with these improvements, the transcript alone was insufficient. Qu ## The wiki -I needed a source of data that was complete, up-to-date, reliable, and neutral because people get *very* passionate about their in-game choices. The community-maintained Cyberpunk Wiki was the obvious answer. +I needed a source of data that was complete, up-to-date, reliable, and neutral because people get *very* passionate about their in-game choices. The [community-maintained Cyberpunk Wiki](https://cyberpunk.fandom.com/wiki/Cyberpunk_Wiki) was the obvious answer. ### The download challenge @@ -291,7 +293,7 @@ def query_llm_cli( conversation_history.append({"role": "ai", "content": response.content}) ``` -This approach leverages LangChain's `ChatPromptTemplate` (loaded via our `LangChainPromptDataset`) to maintain conversation history. The chatbot now has memory of previous exchanges, making the interaction feel natural and conversational. +This approach uses LangChain's `ChatPromptTemplate` (loaded via our `LangChainPromptDataset`) to maintain conversation history. The chatbot now has memory of previous exchanges, making the interaction feel natural and conversational. ## Games belong on Discord @@ -299,7 +301,7 @@ As a stretch goal, I wanted to make this a Discord bot. It seemed fitting. A gam ### The async challenge -To get my Kedro runs to interact with Discord, I used Discord.py, an open-source Python API wrapper for Discord. +To get my Kedro runs to interact with Discord, I used [Discord.py](https://discordpy.readthedocs.io/), an open-source Python API wrapper for Discord. Discord.py is built on asyncio. Kedro pipeline runs are blocking operations. These two paradigms don't play well together. @@ -462,17 +464,7 @@ The code is clean, maintainable, and follows Kedro best practices. More importan And after 466 hours of gameplay and every achievement unlocked, I can confirm: the bot's answers are accurate. Now if only it could tell me when the sequel's release date is going to be. -## Resources - -* [Kedro documentation](https://docs.kedro.org/) - Comprehensive guide to Kedro -* [LangChain documentation](https://python.langchain.com/) - LLM framework -* [Discord.py documentation](https://discordpy.readthedocs.io/) - Discord bot library -* [Sentence Transformers](https://www.sbert.net/) - Semantic embeddings -* [OpenAI API](https://platform.openai.com/docs) - LLM provider -* [Cyberpunk Wiki](https://cyberpunk.fandom.com/wiki/Cyberpunk_Wiki) - Game information source -* [Full Cyberpunk 2077 transcript](https://game-scripts-wiki.blogspot.com/2020/12/cyberpunk-2077-full-transcript.html) - Transcript source - ---- +### Get the code *The complete project is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases.* From 1eebbe143d54f148da16110d217c76e2c6c02d33 Mon Sep 17 00:00:00 2001 From: "L. R. Couto" <57910428+lrcouto@users.noreply.github.com> Date: Fri, 28 Nov 2025 03:23:00 +0000 Subject: [PATCH 4/8] Add relevant skills to introduction Updated the blog post to improve clarity and fix minor errors. --- .../post/KedroCyberpunkBlogPost.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md index 0d362165..f523fb33 100644 --- a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md +++ b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md @@ -1,4 +1,4 @@ -# Building a cyberpunk 2077 knowledge base with Kedro and LangChain +# Building a Cyberpunk 2077 knowledge base with Kedro and LangChain There's a well-known adage about writing that tells people to "write what they know." When I decided to create a project to test an experimental [Kedro dataset](https://docs.kedro.org/projects/kedro-datasets/) for loading [LangChain](https://python.langchain.com/) prompt templates, I took that advice to heart. @@ -6,7 +6,7 @@ I embarked upon the nerdy endeavor of building an LLM-powered question-answering To my pleasant surprise, this project would evolve to become a valuable learning experience in building data pipelines with Kedro, wrestling with LLM limitations, and discovering that sometimes the best solutions come from working within constraints rather than around them. -I decided to write up what I learned to share with the Kedro community. Hopefully you'll find this useful if you have and are trying to achieve . If you want to try it out, you can find the the project [on GitHub as part of the Kedro Academy repository](https://github.com/kedro-org/kedro-academy/tree/main/kedro-cyberpunk-knowledge-base). +I decided to write up what I learned to share with the Kedro community. Hopefully you'll find this useful if you're familiar with Kedro and want to learn how to integrate LLMs into your projects, or if you want to learn how to expose your pipelines through external APIs and messaging platforms. If you want to try it out, you can find the the project [on GitHub as part of the Kedro Academy repository](https://github.com/kedro-org/kedro-academy/tree/main/kedro-cyberpunk-knowledge-base). ## The project @@ -166,7 +166,7 @@ The wiki is substantial, as it's about 15,000 pages. Downloading it required: { "Johnny Silverhand": "Johnny Silverhand is a...", "Arasaka Tower": "Arasaka Tower is located in...", - ... + "..." } ``` @@ -466,6 +466,6 @@ And after 466 hours of gameplay and every achievement unlocked, I can confirm: t ### Get the code -*The complete project is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases.* +*The [complete project](https://github.com/kedro-org/kedro-academy/tree/main/kedro-cyberpunk-knowledge-base) is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases.* From 3beea6f7c5e1b870a45e78717137a37ce15da417 Mon Sep 17 00:00:00 2001 From: "L. R. Couto" <57910428+lrcouto@users.noreply.github.com> Date: Fri, 28 Nov 2025 04:07:07 +0000 Subject: [PATCH 5/8] Add What is Kedro and Learn more about Kedro sections --- .../post/KedroCyberpunkBlogPost.md | 20 +++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md index f523fb33..eda4241e 100644 --- a/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md +++ b/kedro-cyberpunk-knowledge-base/post/KedroCyberpunkBlogPost.md @@ -12,6 +12,17 @@ I decided to write up what I learned to share with the Kedro community. Hopefull At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full [transcript of a Cyberpunk 2077 playthrough](https://game-scripts-wiki.blogspot.com/2020/12/cyberpunk-2077-full-transcript.html) (over 400 pages of dialog), make it searchable, and use it to answer questions accurately. I could only select a specific playthrough on a blog that transcribes games (linked below), and this was a challenge seeing as the game itself has multiple different endings based on the player's choices throughout the story. +## What is Kedro? + +[Kedro](https://kedro.org/) is an open-source Python framework for building production-ready data pipelines. It provides structure and best practices for data engineering projects through: + +* **Nodes**: Pure Python functions that transform data +* **Pipelines**: Directed acyclic graphs (DAGs) of nodes that define execution order +* **Data Catalog**: Centralized configuration for data sources, eliminating hardcoded file paths +* **Parameters**: Externalized configuration for easy experimentation + +For this project, Kedro provided the structure to manage all of the datasets and transformations, and allowed quick, easy iteration while building data pipelines. + ## The transcript The `LangChainPromptDataset` was built to seamlessly integrate LangChain `PromptTemplate` objects into Kedro pipelines, enabling prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. @@ -464,6 +475,15 @@ The code is clean, maintainable, and follows Kedro best practices. More importan And after 466 hours of gameplay and every achievement unlocked, I can confirm: the bot's answers are accurate. Now if only it could tell me when the sequel's release date is going to be. +## Learn more about Kedro + +If you want to dive deeper into Kedro or explore similar concepts: + +* **[Kedro Documentation](https://docs.kedro.org/):** The official docs for Kedro, covering everything from tutorials to advanced usage. +* **[Kedro GitHub Repository](https://github.com/kedro-org/kedro):** Source code, issues, and discussions. +* **[Kedro Youtube Channel](https://www.youtube.com/@kedro-python):** Guides, presentations, and bi-weekly livestreamed coffee chats with the Kedro team. +* **[Kedro Slack Community](https://slack.kedro.org/):** Talk to other Kedro users and maintainers. + ### Get the code *The [complete project](https://github.com/kedro-org/kedro-academy/tree/main/kedro-cyberpunk-knowledge-base) is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases.* From c47bf543b56f0ede25281147c5ca70cb39670b74 Mon Sep 17 00:00:00 2001 From: Jo Stichbury Date: Wed, 10 Dec 2025 12:22:40 +0000 Subject: [PATCH 6/8] Create medium-copy.md --- .../post/medium-copy.md | 366 ++++++++++++++++++ 1 file changed, 366 insertions(+) create mode 100644 kedro-cyberpunk-knowledge-base/post/medium-copy.md diff --git a/kedro-cyberpunk-knowledge-base/post/medium-copy.md b/kedro-cyberpunk-knowledge-base/post/medium-copy.md new file mode 100644 index 00000000..80ed9701 --- /dev/null +++ b/kedro-cyberpunk-knowledge-base/post/medium-copy.md @@ -0,0 +1,366 @@ +--- + +Building a Cyberpunk 2077 knowledge base with Kedro and LangChain +Combining the power of LLMs with the structure of a Kedro workflow  +This article comes from Laura Couto, a software engineer within the Kedro team at QuantumBlack Labs. You'll learn how a Cyberpunk-themed experiment evolved into a fully fledged Retrieval-Augmented Generation system built with Kedro and LangChain, revealing practical patterns for structuring Kedro pipelines, enhancing retrieval quality, and integration with Discord. + +--- + +There's a well-known adage about writing that tells people to "write what they know." When I decided to create a project to test an experimental Kedro dataset for loading LangChain prompt templates, I took that advice to heart. +I embarked upon the nerdy endeavor of building an LLM-powered question-answering knowledge base whose sole purpose is to accurately answer questions about the action role-playing game Cyberpunk 2077. This would be the perfect test subject: I have over 450 hours of gameplay, every achievement unlocked, and have had more than a few passionate discussions (or heated arguments) on Reddit about the game under my belt. I could easily spot inaccurate responses, hallucinations, or any other LLM quirks that might slip through. +Cyberpunk 2077 box art, by Artur Tarnowski, 3D Modeler & Character Artist for CD Projekt (fair use reproduction). The copyright for this image is held by CD PROJEKT RED.To my pleasant surprise, this project would evolve to become a valuable learning experience in building data pipelines with Kedro, wrestling with LLM limitations, and discovering that sometimes the best solutions come from working within constraints rather than around them. +I decided to write up what I learned to share with the Kedro community. Hopefully you'll find this useful if you're familiar with Kedro and want to learn how to integrate LLMs into your projects, or if you want to learn how to expose your pipelines through external APIs and messaging platforms. If you want to try it out, you can find the project on GitHub as part of the Kedro Academy repository. You can also find the video walkthrough of this project in a recent Kedro coffee chat: + +--- + +The project +At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full transcript of a Cyberpunk 2077 playthrough (over 400 pages of dialog), make it searchable, and use it to answer questions accurately.  +This article walks through the workflow shown.What is Kedro? +Kedro is an open-source Python framework for building production-ready data pipelines. It provides structure and best practices for data engineering projects through: +Nodes: Pure Python functions that transform data +Pipelines: Directed acyclic graphs (DAGs) of nodes that define execution order +Data Catalog: Centralized configuration for data sources, eliminating hardcoded file paths +Parameters: Externalized configuration for easy experimentation + +For this project, Kedro provided the structure to manage all of the datasets and transformations, and allowed quick, easy iteration while building data pipelines. +The transcript +The LangChainPromptDataset was built to seamlessly integrate LangChain PromptTemplate objects into Kedro pipelines, enabling prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. +Let's talk about chunking… +When I started, the first issue I encountered was that LLMs have token limits. You can't just dump 400 pages of transcript into a prompt and expect it to work. The transcript needed to be broken down into manageable chunks. +I started by creating a process_transcript Kedro pipeline that handles this transformation: +def chunk_transcript( + transcript: str, chunk_size: int = 1000, overlap: int = 200 +) -> List[Dict[str, Any]]: + """Split transcript into overlapping chunks for context retrieval.""" + cleaned_transcript = re.sub(r"\n+", "\n", transcript.strip()) + sentences = re.split(r"(?<=[.!?])\s+", cleaned_transcript) + + chunks = [] + start_idx = 0 + + while start_idx < len(sentences): + end_idx = min(start_idx + chunk_size, len(sentences)) + chunk_text = " ".join(sentences[start_idx:end_idx]) + + chunks.append({ + "text": chunk_text, + "chunk_id": len(chunks), + "start_sentence": start_idx, + "end_sentence": end_idx - 1, + }) + + # Move forward with overlap to preserve context + start_idx = max(start_idx + chunk_size - overlap, start_idx + 1) + + return chunks +The overlap is crucial as it ensures that context spanning chunk boundaries isn't lost. This is a common pattern in RAG systems, but implementing it as a Kedro node made it easy to experiment with different chunk sizes and overlap values through the pipeline parameters feature. +Then I built the project into two separate pipelines: +process_transcript: Processes raw data once (expensive operation) +query_pipeline: Runs queries repeatedly (cheap operation) + +Kedro's pipeline structure made iteration easier. This separation meant I could process the data once, store it as a Kedro dataset, and then query it as many times as I wanted without reprocessing. The data catalog handles all the file I/O, so storage and loading are straightforward. +Chippin' in: the challenges after chunking +I could only select a specific playthrough, and this was a challenge seeing as the game itself has multiple different endings based on the player's choices throughout the story. The transcript had four fundamental limitations: +Dialog-only content: The 400-page transcript contained only dialog, missing crucial narrative context about what's happening in the scene. +Single playthrough bias: As this transcript was from only one specific playthrough, it didn't have information about alternative paths or endings. +The hallucination problem: When I instructed the LLM to strictly use only the provided context with a low temperature, it would often respond with "I don't have sufficient information." But if I allowed it more freedom or increased the temperature, it would confidently spout misinformation about the game. +Naive keyword matching: My initial approach of using simple keyword matching to find relevant chunks was inadequate. Sometimes completely unrelated chunks would be selected, leading to nonsensical responses about other plot points or characters. What a gonk. + +Preem solutions +Kedro's node-based architecture made it trivial to experiment with different approaches. Each solution became a new node or a modification to an existing one. +Solution 1: PartitionedDataset for better retrieval +I initially stored chunks as individual JSON files. Switching to Kedro's PartitionedDataset made retrieval more efficient and the code cleaner: +def partition_transcript_chunks( + chunks: List[Dict[str, Any]], +) -> Dict[str, Dict[str, Any]]: + """Convert chunks into partition mapping for Kedro's PartitionedDataset.""" + partitions: Dict[str, Dict[str, Any]] = {} + for chunk in chunks: + chunk_id = chunk.get("chunk_id", len(partitions)) + partition_key = f"chunk_{chunk_id}" + partitions[partition_key] = chunk + return partitions +This change alone improved response quality because the partitioned structure made it easier to search and retrieve specific chunks. +Solution 2: Character name extraction +The transcript's "Character: Dialogue" structure made it easy to extract character names: +def extract_characters(transcript: str) -> List[str]: + """Extract unique character names from transcript (format: "Name: Dialogue").""" + character_pattern = r"^([A-Za-z\s]+):" + characters = set() + + for line in transcript.split("\n"): + match = re.match(character_pattern, line.strip()) + if match: + character_name = match.group(1).strip() + if character_name and len(character_name) > 1: + characters.add(character_name) + + return sorted(list(characters)) +I used this character list to boost the relevance score of transcript chunks when a query mentioned a character name. This simple heuristic significantly improved results for character-related questions. +Solution 3: Semantic similarity with Sentence Transformers +The most notable improvement came when I replaced keyword matching with semantic similarity search using Sentence Transformers: +def find_relevant_contexts( + query: str, + transcript_chunks: Dict[str, Any], + wiki_embeddings: Dict[str, Dict[str, Any]], + character_list: List[str], + embedding_model_name: str = "all-MiniLM-L6-v2", + max_chunks: int = 5, + character_bonus: float = 0.05, + wiki_weight: float = 0.7, +) -> List[Dict[str, Any]]: + """Retrieve top relevant contexts using semantic similarity.""" + model = get_embedding_model(embedding_model_name) + query_emb = model.encode(query, convert_to_tensor=True) + + results = [] + # Calculate similarity for transcript chunks and wiki pages... + # Apply character bonus and wiki weight... + + results.sort(key=lambda x: x[0], reverse=True) + return [ + {"source": src, "text": txt, "similarity": sim} + for sim, src, txt in results[:max_chunks] + ] +This change made response quality significantly better. The model could now understand that "Who is Johnny Silverhand?" and "Tell me about the guy who blew up Arasaka Tower?" were asking for similar information, even without exact keyword matches. +However, even with these improvements, the transcript alone was insufficient. Questions about characters worked reasonably well, but questions about game mechanics, missions, or world-building fell flat. The data simply wasn't there. I needed a better data source. +The wiki +I needed a source of data that was complete, up-to-date, reliable, and neutral because people get very passionate about their in-game choices. The community-maintained Cyberpunk Wiki was the obvious answer. +The download challenge +The wiki is substantial, as it's about 15,000 pages. Downloading it required the following: +Respecting API rate limits: I had to add intentional pauses between requests to avoid getting blocked. The download script took a couple of hours to complete. +Choosing the right format: I settled on a single JSON file where each page is a key-value pair: + +{ + "Johnny Silverhand": "Johnny Silverhand is a...", + "Arasaka Tower": "Arasaka Tower is located in...", + "..." +} +This format made it easy to process in Kedro and convert to embeddings. +3. Data cleanup: A significant portion of the work was cleaning the data: +Removing redirect pages +Stripping Markdown syntax +Removing image tags, external links, and language links +Cleaning up formatting artifacts + +I wrote a separate Python script for this cleanup, which was a one-time operation. The cleaned data went into the data/raw/ directory, ready for Kedro to process. +Kedro-specific challenges +This project was already testing one new dataset (LangChainPromptDataset), and I wanted to see how far I could push Kedro's built-in datasets before needing external tools. +The "proper" solution for storing embeddings would be a vector database like Pinecone or Weaviate. But that would require: +Writing another custom dataset to interface with the database +Setting up and managing external infrastructure +Adding complexity to the project + +I decided to test Kedro's limits instead. Could I build a functional RAG system using only Kedro's built-in datasets? +I stored the cleaned wiki data (about 13,000 entries) in a JSONDataset, then added a node to the process_transcript pipeline to generate embeddings: +def embed_wiki_pages( + wiki_data: Dict[str, str], + embedding_model_name: str = "all-MiniLM-L6-v2" +) -> Dict[str, Dict[str, Any]]: + """Generate embeddings for wiki pages using SentenceTransformer.""" + model = get_embedding_model(embedding_model_name) + embedded_pages: Dict[str, Dict[str, Any]] = {} + + for title, text in tqdm(wiki_data.items()): + if not text.strip(): + continue + embedding = model.encode(text, convert_to_numpy=True) + embedded_pages[title] = {"text": text, "embedding": embedding} + + return embedded_pages +Then I stored these embeddings in a PickleDataset: +wiki_embeddings: + type: pickle.PickleDataset + filepath: data/processed/wiki_embeddings.pkl +This approach has trade-offs: +✅ No external dependencies +✅ Fast retrieval (everything in memory) +✅ No type conversion needed +❌ Not scalable for very large datasets +❌ Entire dataset loads into memory + +For 13,000 pages (about 11 megabytes of data), I decided this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's not production-ready for millions of documents, but it proves that Kedro's built-in tools can handle non-trivial workloads well enough. +The results +Integrating wiki embeddings into the context retrieval node significantly improved the quality of the output. The system could now answer questions about: +Characters and locations +Narrative events and mission outcomes +Game mechanics like weapons, systems, skills, or vehicles +World-building and lore + +The prompt engineering also became more effective. With better data, I could confidently instruct the LLM to use only the provided context: +{ + "role": "system", + "content": "You are an expert in Cyberpunk 2077... Using ONLY the information provided from in-game transcripts and wiki entries, answer the user's questions. Do NOT include any information that is not present in the provided context." +} +Previously, this strict instruction led to many "I don't have sufficient data" responses. Now, with comprehensive wiki data, the LLM had enough context to provide accurate, detailed answers while staying within the provided materials. +Conversion of the data sources to Kedro datasetsRetrieval following a user queryMaking it a conversation +At this point, I had a working RAG system, but using it was clunky. Each query required: +Starting a new Kedro session +Running the entire pipeline +Passing the query as a runtime parameter: kedro run --pipeline=query_pipeline --params=user_query="Who is Johnny Silverhand?" + +This was fine for testing but not very nice for actual use. I wanted something more convenient and user-friendly. +The loop solution +Kedro nodes are just Python functions. There's nothing stopping you from putting a loop inside a node: +def query_llm_cli( + transcript_chunks: Dict[str, Any] = None, + wiki_embeddings: Dict[str, Dict[str, Any]] = None, + character_list: List[str] = None, + # ... other parameters ... +) -> None: + """Interactive conversation loop for CLI chatbot.""" + api_key = get_openai_api_key() + llm = get_llm(api_key=api_key, model=llm_model_name, temperature=llm_temperature) + conversation_history: List[Any] = [] + + while True: + user_query = input("🟢 You: ").strip() + if user_query.lower() in {"exit", "quit"}: + break + + # Find relevant contexts and format prompt + contexts = find_relevant_contexts(query=user_query, ...) + new_messages = format_prompt_with_context(...) + + conversation_history.extend(new_messages) + response = llm.invoke(conversation_history) + + print(f"\n⚪ LLM: {response.content}\n") + conversation_history.append({"role": "ai", "content": response.content}) +This approach uses LangChain's ChatPromptTemplate (loaded via our LangChainPromptDataset) to maintain conversation history. The chatbot now has memory of previous exchanges, making the interaction feel natural and conversational. +Prompt history is saved and reused as context in the CLI conversational chatbot.Games belong on Discord +As a stretch goal, I wanted to make this a Discord bot. It seemed fitting. A gaming knowledge base should live where people game. It also brought some interesting insights from an architecture perspective. +The async challenge +To get my Kedro runs to interact with Discord, I used Discord.py, an open-source Python API wrapper for Discord. +Discord.py is built on asyncio for asynchronous I/O. Kedro pipeline runs are blocking operations. These two paradigms don't play well together. +Solution: bootstrap and thread +Each Discord command bootstraps its own Kedro session and runs the pipeline in a separate thread: +def setup_kedro_project() -> Path: + """Bootstrap and configure Kedro project.""" + project_path = Path(__file__).resolve().parent + metadata = bootstrap_project(project_path) + configure_project(metadata.package_name) + return project_path + +@bot.command(name="/query") +async def run_query(ctx: commands.Context, *, user_query: str) -> None: + """Run Kedro query pipeline asynchronously.""" + await ctx.send(f"🚀 Running Kedro pipeline for query: `{user_query}`...") + + project_path = setup_kedro_project() + + def run_kedro() -> Any: + with KedroSession.create( + project_path=project_path, + runtime_params={"user_query": user_query} + ) as session: + return session.run(pipeline_name="query_pipeline", tags=["discord"]) + + result = await asyncio.to_thread(run_kedro) + # ... process and send response ... +This approach has a nice side effect: each query runs in its own pipeline execution, so multiple users can query the bot simultaneously without interfering with each other. +The message length problem +Discord has a 2000-character limit per message. LLM responses can easily exceed this. The solution was a simple chunking function: +DISCORD_MAX_MESSAGE_LENGTH = 2000 +DISCORD_SAFE_MESSAGE_LENGTH = 1900 + +async def send_long_message(ctx: commands.Context, message: str) -> None: + """Send message to Discord, chunking if it exceeds 2000 characters.""" + if len(message) <= DISCORD_SAFE_MESSAGE_LENGTH: + await ctx.send(message) + else: + for i in range(0, len(message), DISCORD_MAX_MESSAGE_LENGTH): + await ctx.send(message[i:i+DISCORD_MAX_MESSAGE_LENGTH])The pipeline architecture challenge +Here's where I got some interesting insights about Kedro pipeline design. The Discord bot and CLI chatbot needed different behaviors: +CLI: Interactive loop, maintains conversation history +Discord: Single query, no history, returns string response + +Attempt 1: duplicate pipelines +My first instinct was to create separate pipelines. This worked but violated DRY principles and Kedro best practices. Not ideal. In fact, Kedro does not even allow nodes to have the same name even if they're in different pipelines. The framework that exists to apply proper software development practices to data projects was doing its job. +Attempt 2: separate pipelines with tags +I tried splitting it into three pipelines: +Context retrieval and prompt assembly (shared) +CLI LLM query node +Discord LLM query node + +The idea was to use tags to control execution. This failed because I couldn't guarantee execution order - sometimes the LLM node would run before context retrieval, leading to hallucinations from the empty context. +Solution: single pipeline with tagged nodes +The chosen approach was a single pipeline with all nodes, using tags to select execution paths: +def create_pipeline() -> Pipeline: + return Pipeline([ + Node( + func=find_relevant_contexts, + inputs=[...], + outputs="relevant_contexts", + tags=["cli", "discord"], # Shared node + ), + Node( + func=format_prompt_with_context, + inputs=[...], + outputs="formatted_prompt", + tags=["cli", "discord"], # Shared node + ), + Node( + func=query_llm_cli, + inputs=[...], + outputs="llm_response_cli", + tags=["cli"], # CLI-only + ), + Node( + func=query_llm_discord, + inputs=[...], + outputs="llm_response_discord", + tags=["discord"], # Discord-only + ), + ]) +Now I can run: +kedro run --pipeline=query_pipeline --tags=cli for CLI mode +kedro run --pipeline=query_pipeline --tags=discord for Discord mode + +Kedro's tag system ensures only the appropriate nodes run, and execution order is guaranteed because nodes are connected by their inputs and outputs. This is the Kedro way: use the framework's features to solve problems elegantly. +Tags are used for a single pipeline for both CLI and Discord outputs.Lessons learned +This project started with building a simple query system to test a new dataset. It ended up being a very insightful learning experience. +The working project running on Discord.1. Separation of concerns is important +Having separate pipelines for processing and querying meant I could iterate on query logic without reprocessing 400 pages of transcript and 13,000 wiki pages. This separation, enforced by Kedro's architecture, was a great time saver. +2. The data catalog is your friend +Never hardcode file paths. The catalog makes it trivial to: +Change data locations +Use different datasets for testing vs. production +Track data lineage + +3. Parameters enable experimentation +Moving magic numbers to parameters.yml made it easy to experiment: +What's the optimal chunk size? +How much overlap is needed? +What temperature works best for this use case? + +Change a parameter, rerun the pipeline. No code changes needed. +4. Nodes are just functions +Don't overthink it. A node can be a simple transformation, a complex loop, or anything in between. Kedro provides structure, not restrictions. +5. Tags are powerful +The tag system solved a real architectural problem elegantly. It's a simple feature with powerful implications for pipeline organization. +Conclusion: from experiment to production pattern +What started as a test of an experimental dataset became a comprehensive exploration of building production-ready data pipelines with Kedro, demonstrating: +How Kedro's pipelines, nodes, catalog, parameters, and tags provide the backbone for a clean, scalable RAG architecture. +Why thoughtful data handling (chunking, embeddings, semantic search, and wiki integration) dramatically improves retrieval accuracy. +How challenges can surface when bridging LLM workflows with real-world constraints like token limits, hallucinations, and Discord's async environment. +That a single, well-structured Kedro pipeline supports multiple execution modes (CLI and Discord) without duplication. + +The code is clean, maintainable, and follows Kedro best practices. More importantly, it works. The bot can answer questions about Cyberpunk 2077, drawing from both the game transcript and comprehensive wiki data. +And after 466 hours of gameplay and every achievement unlocked, I can confirm: the bot's answers are accurate. Now, if only it could tell me when the sequel's release date is going to be…🤔 +Learn more about Kedro +Kedro is an open-source Python toolbox that applies software engineering principles to data engineering and data science code. It reduces the time spent rewriting data science experiments so that they are fit for production. +Kedro was born at QuantumBlack to solve the challenges faced regularly in data science projects and promote teamwork through standardized team workflows. It is now hosted by the LF AI & Data Foundation as an incubating project. +If you want to dive deeper into Kedro or explore similar concepts: +Kedro Documentation: The official docs for Kedro, covering everything from tutorials to advanced usage. +Kedro GitHub Repository: Source code, issues, and discussions. +Kedro YouTube Channel: Guides, presentations, and bi-weekly livestreamed coffee chats with the Kedro team. +Kedro Slack Community: Talk to other Kedro users and maintainers. + +Get the code +The complete project is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases. +Feedback? Let us know in the comments below, or respond on GitHub. + +--- + +Many thanks to Laura Couto for authoring, to Rashida Kanchwala and Joseph Perkins for review, and Jo Stichbury for editorial guidance. \ No newline at end of file From 589bbad86919aff39d289685877f908ce7eb4cbc Mon Sep 17 00:00:00 2001 From: "L. R. Couto" <57910428+lrcouto@users.noreply.github.com> Date: Thu, 11 Dec 2025 00:36:43 -0300 Subject: [PATCH 7/8] Add changes suggested by technical stakeholder --- .../post/medium-copy.md | 39 ++++++++++++++----- 1 file changed, 30 insertions(+), 9 deletions(-) diff --git a/kedro-cyberpunk-knowledge-base/post/medium-copy.md b/kedro-cyberpunk-knowledge-base/post/medium-copy.md index 80ed9701..b16de54e 100644 --- a/kedro-cyberpunk-knowledge-base/post/medium-copy.md +++ b/kedro-cyberpunk-knowledge-base/post/medium-copy.md @@ -15,14 +15,20 @@ I decided to write up what I learned to share with the Kedro community. Hopefull The project At its core, this is a Retrieval-Augmented Generation (RAG) system built with Kedro. The initial goal was to take a full transcript of a Cyberpunk 2077 playthrough (over 400 pages of dialog), make it searchable, and use it to answer questions accurately.  -This article walks through the workflow shown.What is Kedro? + +The following diagram illustrates how the RAG pipeline operates end-to-end. The raw data is chunked and embeddings are generated from it, all of which is stored in Kedro datasets. The user query is then passed as a parameter to the pipeline, and semantic similarity search is used to find the most relevant part of the stored data. This is then used as context to format a LangChain ChatPromptTemplate object. Finally, the LLM processes this prompt to produce a response, which becomes the final product of this pipeline. + +This article walks through the workflow shown. + +What is Kedro? Kedro is an open-source Python framework for building production-ready data pipelines. It provides structure and best practices for data engineering projects through: Nodes: Pure Python functions that transform data Pipelines: Directed acyclic graphs (DAGs) of nodes that define execution order Data Catalog: Centralized configuration for data sources, eliminating hardcoded file paths Parameters: Externalized configuration for easy experimentation -For this project, Kedro provided the structure to manage all of the datasets and transformations, and allowed quick, easy iteration while building data pipelines. +Kedro transforms what is often a loose collection of scripts and LangChain components into a coherent, well-defined and reproducible RAG architecture. For this project, it provided the structure to manage all of the datasets and transformations, allowing quick, easy iteration while building data pipelines. + The transcript The LangChainPromptDataset was built to seamlessly integrate LangChain PromptTemplate objects into Kedro pipelines, enabling prompts to be loaded as raw data files and reducing boilerplate code. For a proper field test, I wanted to use it with a real LLM query workflow, not just unit tests or mock responses. Let's talk about chunking… @@ -123,12 +129,16 @@ def find_relevant_contexts( for sim, src, txt in results[:max_chunks] ] This change made response quality significantly better. The model could now understand that "Who is Johnny Silverhand?" and "Tell me about the guy who blew up Arasaka Tower?" were asking for similar information, even without exact keyword matches. + +I've chosen to use SentenceTransformers to embed and retrieve information at the sentence level to try to get a balance between retrieving the part of the data that was most relevant and preserving meaning. Token-based or size-based chunking, like LangChain TextSplitter, risks slicing an idea arbitrarily, while long paragraphs can end up mixing unrelated concepts. Sentence level embeddings give me a chunk of context that's small enough to not contain unrelated ideas or that end up using too many tokens, while still preserving the meaning, which is particularly interesting in the case of a game like Cyberpunk 2077, that contains a lot of invented slang, pseudo-technological terms, and other fictional bits of language. + However, even with these improvements, the transcript alone was insufficient. Questions about characters worked reasonably well, but questions about game mechanics, missions, or world-building fell flat. The data simply wasn't there. I needed a better data source. The wiki I needed a source of data that was complete, up-to-date, reliable, and neutral because people get very passionate about their in-game choices. The community-maintained Cyberpunk Wiki was the obvious answer. The download challenge The wiki is substantial, as it's about 15,000 pages. Downloading it required the following: -Respecting API rate limits: I had to add intentional pauses between requests to avoid getting blocked. The download script took a couple of hours to complete. +Respecting API rate limits: I had to add intentional pauses between requests to avoid getting blocked. The download script, which ran simple GET requests to the fandom.com API with `sleep()` calls for pauses, took a couple of hours to complete. Getting fresh data from the wiki would require the script to be run again manually, but since Cyberpunk 2077 is a completed game that receives very infrequent updates, I decided this was not an issue. + Choosing the right format: I settled on a single JSON file where each page is a key-value pair: { @@ -143,7 +153,8 @@ Stripping Markdown syntax Removing image tags, external links, and language links Cleaning up formatting artifacts -I wrote a separate Python script for this cleanup, which was a one-time operation. The cleaned data went into the data/raw/ directory, ready for Kedro to process. +I wrote a separate Python script using string manipulation for this cleanup, which was a one-time operation. The cleaned data went into the data/raw/ directory, ready for Kedro to process. Neither of those scripts are included in the project's repository, as they're not directly related to running the RAG pipeline. + Kedro-specific challenges This project was already testing one new dataset (LangChainPromptDataset), and I wanted to see how far I could push Kedro's built-in datasets before needing external tools. The "proper" solution for storing embeddings would be a vector database like Pinecone or Weaviate. But that would require: @@ -179,7 +190,8 @@ This approach has trade-offs: ❌ Not scalable for very large datasets ❌ Entire dataset loads into memory -For 13,000 pages (about 11 megabytes of data), I decided this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's not production-ready for millions of documents, but it proves that Kedro's built-in tools can handle non-trivial workloads well enough. +For 13,000 pages (about 11 megabytes of data), I decided this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's easy to run locally, does not require any setup, and allowed me to iterate quickly on writing the pipeline. It would not be a suitable approach for production, as the pickled embeddings must be fully loaded into memory, and the PickleDataset does not offer the same range of features that a vector database would, like indexing, metadata filtering, or persistence in storage. But for a relatively small project, it demonstrates well enough that Kedro's built-in tools can handle non-trivial workloads. + The results Integrating wiki embeddings into the context retrieval node significantly improved the quality of the output. The system could now answer questions about: Characters and locations @@ -230,9 +242,10 @@ def query_llm_cli( This approach uses LangChain's ChatPromptTemplate (loaded via our LangChainPromptDataset) to maintain conversation history. The chatbot now has memory of previous exchanges, making the interaction feel natural and conversational. Prompt history is saved and reused as context in the CLI conversational chatbot.Games belong on Discord As a stretch goal, I wanted to make this a Discord bot. It seemed fitting. A gaming knowledge base should live where people game. It also brought some interesting insights from an architecture perspective. + The async challenge To get my Kedro runs to interact with Discord, I used Discord.py, an open-source Python API wrapper for Discord. -Discord.py is built on asyncio for asynchronous I/O. Kedro pipeline runs are blocking operations. These two paradigms don't play well together. +Discord.py is built on asyncio for asynchronous I/O, and Discord is built on asynchronous operations, as it's common for messaging application. A Discord bot must constantly yield control to the Discord event loop so it can handle network I/O, message dispatching, etc. Kedro pipeline runs, on the other hand, are blocking operations. They run synchronously inside a blocking Python call stack, occupying the main thread until completion. These two paradigms don't play well together. Solution: bootstrap and thread Each Discord command bootstraps its own Kedro session and runs the pipeline in a separate thread: def setup_kedro_project() -> Path: @@ -276,7 +289,7 @@ CLI: Interactive loop, maintains conversation history Discord: Single query, no history, returns string response Attempt 1: duplicate pipelines -My first instinct was to create separate pipelines. This worked but violated DRY principles and Kedro best practices. Not ideal. In fact, Kedro does not even allow nodes to have the same name even if they're in different pipelines. The framework that exists to apply proper software development practices to data projects was doing its job. +My first instinct was to create separate pipelines. This worked but violated DRY principles and Kedro best practices. Not ideal. In fact, Kedro does not even allow nodes to have the same name even if they're in different pipelines. It enforces pipeline integrity, and duplicating nodes breaks maintainability. The framework that exists to apply proper software development practices to data projects was doing its job. Attempt 2: separate pipelines with tags I tried splitting it into three pipelines: Context retrieval and prompt assembly (shared) @@ -346,7 +359,13 @@ Why thoughtful data handling (chunking, embeddings, semantic search, and wiki in How challenges can surface when bridging LLM workflows with real-world constraints like token limits, hallucinations, and Discord's async environment. That a single, well-structured Kedro pipeline supports multiple execution modes (CLI and Discord) without duplication. -The code is clean, maintainable, and follows Kedro best practices. More importantly, it works. The bot can answer questions about Cyberpunk 2077, drawing from both the game transcript and comprehensive wiki data. +It was a good opportunity to explore patterns that generalize well to production RAG systems, like separation of data ingestion, preprocessing, embedding and retrieval, so each of those stages can be worked on independently. And things like chunks, embeddings, prompts, and other artifacts being stored as reproducible datasets make them easy to swap and experiment with. + +Kedro brings structure to this exploration. It enforces correct ordering of transformations, offers a configuration layer that keeps thing like paths, parameters, and credentials easy to track, and formalizes I/O so artifacts are typed and testable. Nodes as pure Python functions give flexibility in writing code, while modular pipelines make the project easier to refactor when necessary. + +It was also nice to see how a simple PickleDataset worked very well for a demo. Were the project to be scaled to a larger level, for example, if I wanted to add hundreds of thousands of social media posts about Cyberpunk 2077 to my data sources, it would benefit from moving to a vector database. After a certain point, working with embeddings in memory would become too slow, and a vector database would provide much faster similarity search, efficient indexing, filtering, and general robustness that would be beneficial for a larger project. + +As it is, the code is clean, maintainable, and follows Kedro best practices. More importantly, it works. The bot can answer questions about Cyberpunk 2077, drawing from both the game transcript and comprehensive wiki data. And after 466 hours of gameplay and every achievement unlocked, I can confirm: the bot's answers are accurate. Now, if only it could tell me when the sequel's release date is going to be…🤔 Learn more about Kedro Kedro is an open-source Python toolbox that applies software engineering principles to data engineering and data science code. It reduces the time spent rewriting data science experiments so that they are fit for production. @@ -361,6 +380,8 @@ Get the code The complete project is available on GitHub as part of the Kedro Academy repository. Feel free to explore, experiment, and adapt it for your own use cases. Feedback? Let us know in the comments below, or respond on GitHub. +When experimenting with automated retrieval and prompting, be aware that it's possible to hit external rate limits. The OpenAI API may throttle you if you send too many simultaneous requests, and Discord enforces per-channel and per-user message limits that can cause your bot to be temporarily blocked if disrespected. + --- -Many thanks to Laura Couto for authoring, to Rashida Kanchwala and Joseph Perkins for review, and Jo Stichbury for editorial guidance. \ No newline at end of file +Many thanks to Laura Couto for authoring, to Rashida Kanchwala and Joseph Perkins for review, and Jo Stichbury for editorial guidance. From 676cb7fdd2ee9d68810ccd7ae0896e674eec1db6 Mon Sep 17 00:00:00 2001 From: "L. R. Couto" <57910428+lrcouto@users.noreply.github.com> Date: Thu, 11 Dec 2025 00:53:52 -0300 Subject: [PATCH 8/8] Fix run-on sentences --- kedro-cyberpunk-knowledge-base/post/medium-copy.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kedro-cyberpunk-knowledge-base/post/medium-copy.md b/kedro-cyberpunk-knowledge-base/post/medium-copy.md index b16de54e..850fcb25 100644 --- a/kedro-cyberpunk-knowledge-base/post/medium-copy.md +++ b/kedro-cyberpunk-knowledge-base/post/medium-copy.md @@ -130,14 +130,14 @@ def find_relevant_contexts( ] This change made response quality significantly better. The model could now understand that "Who is Johnny Silverhand?" and "Tell me about the guy who blew up Arasaka Tower?" were asking for similar information, even without exact keyword matches. -I've chosen to use SentenceTransformers to embed and retrieve information at the sentence level to try to get a balance between retrieving the part of the data that was most relevant and preserving meaning. Token-based or size-based chunking, like LangChain TextSplitter, risks slicing an idea arbitrarily, while long paragraphs can end up mixing unrelated concepts. Sentence level embeddings give me a chunk of context that's small enough to not contain unrelated ideas or that end up using too many tokens, while still preserving the meaning, which is particularly interesting in the case of a game like Cyberpunk 2077, that contains a lot of invented slang, pseudo-technological terms, and other fictional bits of language. +I've chosen to use SentenceTransformers to embed and retrieve information at the sentence level to try to get a balance between retrieving the part of the data that was most relevant and preserving meaning. Token-based or size-based chunking, like LangChain TextSplitter, risks slicing an idea arbitrarily, while long paragraphs can end up mixing unrelated concepts. Sentence level embeddings give me a chunk of context that's small enough to not contain unrelated ideas or that end up using too many tokens, while still preserving the meaning. This is particularly interesting in the case of a game like Cyberpunk 2077, that contains a lot of invented slang, pseudo-technological terms, and other fictional bits of language. However, even with these improvements, the transcript alone was insufficient. Questions about characters worked reasonably well, but questions about game mechanics, missions, or world-building fell flat. The data simply wasn't there. I needed a better data source. The wiki I needed a source of data that was complete, up-to-date, reliable, and neutral because people get very passionate about their in-game choices. The community-maintained Cyberpunk Wiki was the obvious answer. The download challenge The wiki is substantial, as it's about 15,000 pages. Downloading it required the following: -Respecting API rate limits: I had to add intentional pauses between requests to avoid getting blocked. The download script, which ran simple GET requests to the fandom.com API with `sleep()` calls for pauses, took a couple of hours to complete. Getting fresh data from the wiki would require the script to be run again manually, but since Cyberpunk 2077 is a completed game that receives very infrequent updates, I decided this was not an issue. +Respecting API rate limits: I had to add intentional pauses between requests to avoid getting blocked. The download script, which ran simple GET requests to the fandom.com API with `sleep()` calls for pauses, took a couple of hours to complete. Getting fresh data from the wiki would require the script to be run again manually, but I decided this was not an issue since Cyberpunk 2077 is a completed game that receives very infrequent updates. Choosing the right format: I settled on a single JSON file where each page is a key-value pair: @@ -190,7 +190,7 @@ This approach has trade-offs: ❌ Not scalable for very large datasets ❌ Entire dataset loads into memory -For 13,000 pages (about 11 megabytes of data), I decided this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's easy to run locally, does not require any setup, and allowed me to iterate quickly on writing the pipeline. It would not be a suitable approach for production, as the pickled embeddings must be fully loaded into memory, and the PickleDataset does not offer the same range of features that a vector database would, like indexing, metadata filtering, or persistence in storage. But for a relatively small project, it demonstrates well enough that Kedro's built-in tools can handle non-trivial workloads. +For 13,000 pages (about 11 megabytes of data), I decided this was a perfectly reasonable trade-off. The embeddings load quickly, and the similarity search is fast enough for interactive use. It's easy to run locally, does not require any setup, and allowed me to iterate quickly on writing the pipeline. It would not be a suitable approach for production, as the pickled embeddings must be fully loaded into memory. In addition, the PickleDataset does not offer the same range of features that a vector database would, like indexing, metadata filtering, or persistence in storage. But for a relatively small project, it demonstrates well enough that Kedro's built-in tools can handle non-trivial workloads. The results Integrating wiki embeddings into the context retrieval node significantly improved the quality of the output. The system could now answer questions about: @@ -363,7 +363,7 @@ It was a good opportunity to explore patterns that generalize well to production Kedro brings structure to this exploration. It enforces correct ordering of transformations, offers a configuration layer that keeps thing like paths, parameters, and credentials easy to track, and formalizes I/O so artifacts are typed and testable. Nodes as pure Python functions give flexibility in writing code, while modular pipelines make the project easier to refactor when necessary. -It was also nice to see how a simple PickleDataset worked very well for a demo. Were the project to be scaled to a larger level, for example, if I wanted to add hundreds of thousands of social media posts about Cyberpunk 2077 to my data sources, it would benefit from moving to a vector database. After a certain point, working with embeddings in memory would become too slow, and a vector database would provide much faster similarity search, efficient indexing, filtering, and general robustness that would be beneficial for a larger project. +It was also nice to see how a simple PickleDataset worked very well for a demo. For example, if I wanted to add hundreds of thousands of social media posts about Cyberpunk 2077 to my data sources, the project would benefit from moving to a vector database. After a certain point, working with embeddings in memory would become too slow, and a vector database would provide much faster similarity search, efficient indexing, filtering, and general robustness that would be beneficial for a larger project. As it is, the code is clean, maintainable, and follows Kedro best practices. More importantly, it works. The bot can answer questions about Cyberpunk 2077, drawing from both the game transcript and comprehensive wiki data. And after 466 hours of gameplay and every achievement unlocked, I can confirm: the bot's answers are accurate. Now, if only it could tell me when the sequel's release date is going to be…🤔