Skip to content
WorksBuddy Logo
Ranko

How to Create an LLM Without Wasting Six Months on the Wrong Architecture

Skip six months of wrong moves. Learn which LLM path actually fits your budget—build, fine-tune, or API—with the architecture and data decisions that separate shipped projects from expensive experiments.

Rohan Mehta
Rohan Mehta
June 25, 202610 min read1,207 views
Key takeaways

What you'll learn in 10 minutes

  • Should You Build, Fine-Tune, or Use an Existing LLM?
  • What Does an LLM Actually Need to Run?
  • How Do You Collect and Prepare Training Data?
  • How Do You Actually Train the Model?
  • How Do You Evaluate Whether Your LLM Is Actually Working?
Modern 3D render of efficient LLM architecture workspace with neural network visualization and optimization dashboard

TL;DR: Most guides on how to create an LLM explain the theory and skip the decisions that actually cost you time and money. This one walks IT company owners through architecture choices, data pipeline tradeoffs, and compute budgeting with enough specificity to know whether building makes sense for your situation. You'll finish with a clear framework for the build-vs-buy call.

Should You Build, Fine-Tune, or Use an Existing LLM?

The decision you make here determines your timeline, budget, and whether the project ships at all.

Training from scratch means building a large language model on your own dataset, your own compute, and your own architecture choices. For a 7B parameter model, you're looking at thousands of GPU hours on A100s and cloud bills that routinely run $500K–$2M before you have anything usable. Unless your business has a genuinely unique language domain that no existing model has seen, this path is almost never the right call for an IT company.

Fine-tuning a base model is where most custom LLM for business projects actually belong. You take an open-source foundation, such as Mistral 7B or Meta's LLaMA 2, and train it further on your domain-specific data. Fine-tuning a 7B model on a focused dataset costs roughly $50–$500 in cloud compute, depending on dataset size and the method you use. LoRA (Low-Rank Adaptation) cuts that further by updating only a small subset of parameters. If you need a model that understands your internal ticketing language, your client contracts, or your product documentation, fine-tune an LLM rather than build one from zero.

API integration is the fastest path and the right default for most IT owners. Connect to OpenAI, Anthropic, or Google's Gemini via API, add a retrieval layer over your own data, and you have a working system in days, not months. The tradeoff is data privacy and ongoing per-token cost.

The honest framework for how to create an LLM that ships: start with API integration, move to fine-tuning if you hit domain accuracy limits, and treat training from scratch as a last resort. Understanding how LLMs surface in AI-driven search is also worth reading before you commit to a custom build.

What Does an LLM Actually Need to Run?

Before you write a single line of training code, you need to know what the hardware bill will look like. Most tutorials explain transformer architecture in theory and skip the part where you actually have to pay for it.

An LLM runs on three things: compute, memory, and storage. Compute means GPUs or TPUs. A 7B parameter model trained from scratch needs roughly 100,000–200,000 GPU hours on A100-class hardware — that translates to $150,000–$500,000 on AWS or GCP, depending on spot pricing and region. Memory is the constraint most teams hit first: a 7B model in FP16 precision needs around 14GB of VRAM just to load, before you account for optimizer states and gradients during training, which can push that to 80GB+. Storage matters too — training datasets for models like LLaMA 2 or Mistral 7B run to 1–2 trillion tokens, which means several terabytes of cleaned, formatted text before you start.

The llm architecture choice — decoder-only transformer being the current standard for generative tasks — determines how these costs scale. Wider models need more memory per layer; deeper models need more sequential compute.

If you're mapping the business processes your LLM will automate before you train it, these numbers are what turn a vague project scope into a real budget conversation with your CFO.

How Do You Collect and Prepare Training Data?

Data quality determines model quality more directly than architecture does. A well-structured transformer trained on dirty, duplicated, or poorly formatted text will produce confident-sounding nonsense. Most teams discover this six months in, not six weeks.

LLM training data typically needs to hit a minimum of 1 trillion tokens to train a general-purpose model from scratch. LLaMA 2 used 2 trillion tokens; Mistral 7B trained on roughly 1 trillion. If you're building something narrower, a domain-specific model can work with far less, but the quality bar gets higher as the volume drops. Before you train an LLM from scratch, mapping the business processes your LLM will automate tells you exactly which data sources actually matter.

Data preparation breaks into four concrete steps:

  1. Source selection: Pull from sources that match your target domain. Common inputs include web crawls, internal documents, licensed datasets, and structured databases. Web data needs aggressive filtering.

  2. Deduplication: Near-duplicate content inflates apparent dataset size and causes the model to overfit to repeated patterns. MinHash or exact-match deduplication catches most of it.

  3. Cleaning: Strip HTML artifacts, boilerplate, personally identifiable information, and low-quality text. A perplexity filter (scoring text against a small reference model) removes the worst outliers efficiently.

  4. Formatting: Tokenizer-specific formatting matters. Inconsistent whitespace, encoding mismatches, and mixed languages create subtle training instabilities that are hard to trace back to the data.

Bad data doesn't just reduce accuracy. It bakes in biases and hallucination patterns that persist through fine-tuning. Catching problems at the data stage costs hours. Catching them post-training costs months. Process mapping the workflows your model will touch before you collect data is the fastest way to avoid sourcing the wrong inputs entirely.

How Do You Actually Train the Model?

Training a model is where abstract architecture decisions become real compute bills. The sequence matters, and so does the order you validate each stage.

Tokenization comes first. Your raw text gets converted into integer sequences using a tokenizer — SentencePiece and Byte Pair Encoding (BPE) are the two most common approaches. The tokenizer vocabulary size (typically 32K–64K tokens) directly affects model size and inference speed, so set it before you touch the training loop.

Batching and the forward pass come next. Each batch of token sequences runs through the transformer layers, generating predictions. The model compares those predictions against the actual next tokens using cross-entropy loss. That loss number is your signal: if it plateaus early, your learning rate is wrong; if it spikes, your data has formatting noise that slipped through cleaning.

Checkpointing is the step most teams skip until they lose a run. Save model weights every few thousand steps to cloud storage. Training a 7B-parameter model from scratch requires roughly 6,000–12,000 GPU hours on A100s, which translates to $50,000–$150,000 on AWS or GCP depending on spot pricing. A crashed run without checkpoints means starting over.

The full loop, in order:

  1. Tokenize and shard your dataset across GPU workers

  2. Run forward pass, compute cross-entropy loss

  3. Backpropagate gradients, update weights via AdamW optimizer

  4. Log loss, learning rate, and gradient norm per step

  5. Checkpoint weights at regular intervals

If you've already mapped the business processes your LLM will automate before this stage, your loss targets have real business meaning — not just benchmark scores.

How Do You Evaluate Whether Your LLM Is Actually Working?

Evaluation is where most custom LLM projects stall. Teams ship a model, run a few informal tests, and declare it "good enough" — until it hallucinates a client's contract terms or returns gibberish on edge-case queries.

Four metrics actually tell you something useful:

  • Perplexity measures how surprised your model is by held-out text. Lower is better. A perplexity above 20 on domain-specific validation data usually means your training corpus was too thin or too generic.

  • BLEU score compares model output against reference answers token by token. Useful for structured tasks like summarization or translation. Less useful for open-ended generation, where a low BLEU score can still mean a good answer.

  • Hallucination rate is the one most tutorials skip. Track it manually on a 50–100 sample eval set: count how often the model asserts a false fact confidently. If you're building a custom LLM for business use, anything above 5% in a high-stakes domain is a deployment blocker.

  • Task-specific benchmarks matter more than generic ones. If you fine-tune an LLM on IT service tickets, your benchmark should be ticket resolution accuracy, not MMLU.

When numbers are bad, the fix is usually upstream: more domain data, tighter prompt formatting, or a longer fine-tuning run. Before you reach deployment, it also helps to have mapped the business processes your LLM will automate — that map becomes your eval criteria.

How Do You Deploy an LLM Into a Real Business Workflow?

Evaluation metrics tell you when your model is ready. Deployment is where it either earns its keep or disappears into a Notion doc.

For most IT owners, LLM deployment means wrapping your model in a REST API, then connecting that API to the tools your team already uses — your CRM, your ticketing system, your internal knowledge base. Zapier handles simple trigger-to-action routing. For anything stateful, you'll need a lightweight orchestration layer like LangChain or a custom FastAPI service.

Latency and cost are the two numbers that determine whether a custom LLM for business actually sticks. A 7B parameter model served on a single A100 GPU typically returns responses in 200–400ms — acceptable for async workflows, borderline for live chat. Cost-per-query on AWS or GCP for that same model runs roughly $0.002–$0.006 depending on token length and batching strategy. If your query volume is low, a fine-tuned model via an inference endpoint beats hosting your own every time.

Before you wire anything up, map the business processes your LLM will automate — specifically which inputs trigger the model and where the output goes next. Skipping this step is why most deployments stall after the demo. You can also process map the workflows your model will touch to catch handoff gaps before they become production bugs.

What Breaks When You Create an LLM and How Do You Fix It?

Four failure modes account for most wasted months when you create an LLM.

Underfitting usually means your LLM training data is too small or too homogeneous. A 7B-parameter model needs roughly 1–2 trillion tokens to generalize well — Mistral 7B trained on over 1 trillion. If you're working with a domain-specific corpus of a few million tokens, fine-tune an LLM instead of training from scratch.

Catastrophic forgetting hits during fine-tuning when new task data overwrites general reasoning. Fix it with LoRA or QLoRA, which update a small fraction of weights rather than the full parameter set.

Tokenization errors surface when your corpus contains domain jargon the base tokenizer never saw. Audit token fertility rates before training starts, not after.

Context window overflow at inference time is an architecture choice, not a runtime patch. Before you build, map the business processes your model will automate — that tells you the minimum context length you actually need, and prevents over-engineering a 128K window for a task that needs 4K.

Closing

The build-vs-buy decision hinges on one question: does your business have a language domain so specific that no existing model handles it well enough? For most IT companies, the answer is no. Start with API integration, measure where it falls short, then move to fine-tuning if the accuracy gap justifies the cost and complexity. Training from scratch almost never makes sense unless you're operating at scale with a genuinely unique dataset and a dedicated ML team.

Once your LLM is deployed and running, the next problem surfaces immediately: connecting its output to the rest of your business without building custom glue code for every integration. That's where Revo handles the automation layer, routing model outputs to your ticketing system, CRM, or internal tools without manual handoffs. And as your model drifts or you iterate on training data, Taro keeps that ongoing work tracked and assigned across your team. Start by auditing whether you actually need to build, then wire it into your workflow stack.

FAQ

How long does it take to create an LLM from scratch?

Training a 7B parameter model from scratch requires 6,000–12,000 GPU hours on A100s, which translates to weeks of continuous compute. Fine-tuning an existing model takes days to weeks. API integration works in days.

How much does it cost to train a large language model?

Training from scratch: $500K–$2M. Fine-tuning a 7B model: $50–$500 depending on dataset size and method. API integration: per-token pricing, typically $0.01–$0.10 per 1K tokens.

Can a small IT team build an LLM without a dedicated ML engineer?

Not from scratch—that requires specialized infrastructure and deep training expertise. Fine-tuning or API integration is feasible with a generalist engineer and cloud documentation, though you'll hit edge cases faster.

What is the difference between training an LLM and fine-tuning one?

Training from scratch builds a model on your own dataset and compute from zero, costing $500K–$2M. Fine-tuning takes an existing open-source model and trains it further on your domain-specific data, costing $50–$500.

How much data do you need to train an LLM?

General-purpose models need 1–2 trillion tokens. Domain-specific models can work with less, but quality must rise as volume drops. Start by mapping which data sources actually matter to your business.

Get tactical playbooks every Tuesday

One email. 5-min read. Tactical reads for B2B operators who actually run the business.

Join 48,000+ B2B operators · Unsubscribe anytime

Rohan Mehta
Rohan Mehta
20 Articles

Rohan Mehta is a Startup Operations Advisor & Product Builder who has scaled operations teams at three early-stage companies from seed to Series A. He writes about building lean ops infrastructure, making the right hiring decisions for operational roles, and the systems choices that either unlock growth or quietly hold it back.