Skip to main contentSkip to main content
Apertia.ai
Training AI Models in 2025: Data, Law, and the Technological Leap Forward
Umělá inteligenceJuly 4, 2025|7 min

Training AI Models in 2025: Data, Law, and the Technological Leap Forward

Training large language models has moved from a laboratory discipline into the mainstream of enterprise solution development. Developers today face not only technical challenges...

T
Tým Apertia
Apertia.ai
Share:

Training large language models has moved from a laboratory discipline into the mainstream of enterprise solution development. Developers today face not only technical challenges but also new legal boundaries, the expansion of open-source architectures, and pressure for transparency in data provenance. This article offers a comprehensive overview of what training an AI model entails today - with references to current cases, decisions, and technological developments.

Training Pipeline: What It Consists Of

Training a large language model (LLM) today typically involves these phases:

Phase Description
Data Collection Gathering extensive text corpora (web, books, code, documentation)
Filtering and Tokenization Removing noise, duplicates, tokenization for neural network input
Pre-training Statistical learning of language structure
Fine-tuning Tuning the model for a specific domain or communication style
Alignment Adjusting outputs using feedback (e.g., RLHF)

Architecturally, most teams today use efficient training frameworks like DeepSpeed, Axolotl, vLLM, or are moving to smaller specialized models (Mistral, Phi-3).

What Data Can Be Used for Training and Why It Matters

The choice of training data directly affects both the model's performance and legal standing. In June 2025, a U.S. federal court ruled in favor of Meta, which used pirated copies of books from the Library Genesis database when training the LLaMA model. According to Judge Vince Chhabria, this constituted transformative use (fair use), because the model did not copy texts verbatim but used them to learn language patterns (The Verge, 2025).

Simultaneously, Anthropic defended training its Claude model on scanned physical books purchased legally. Judge William Alsup compared AI training to the way humans learn to read and write: reading is not copying, but learning the principle (AP News, 2025).

Want a Custom AI Solution?

We help companies automate processes with AI. Contact us to find out how we can help you.

  • Response within 24 hours
  • No-obligation consultation
  • Solutions tailored to your business
More contacts

Data Type Comparison:

Data Type Advantages Risks
Books Language quality, deeper context Copyright (EU), limited domains
Web articles Scale, timeliness Bias, inaccuracy, duplication
Code repositories Formal syntax, functional examples GPL/MIT licenses - need to verify legal compatibility
Proprietary company data Domain relevance, know-how GDPR, internal data governance, need for pseudonymization
Synthetic data Controlled content, tunable properties Potential bias transfer, limited creativity

However, in the European Union fair use cannot be relied upon - here the DSM Directive (2019/790) applies, which allows the use of data for text and data mining, but only if the rights holder has not actively excluded their works from analysis. This means that developers in the EU must document the legal origin of data.

Technology Evolution: Growing Context and Specialization

Training today means not just teaching a model to "read texts," but also handling long context, efficient inference, tools, reasoning, and responsibility.

Evolution of Maximum AI Model Context (2018-2025):

While GPT-2 worked with 2,048 tokens, today's models like Claude 4 or MiniMax-M1 can hold up to 1 million tokens. This enables:

  • Loading entire financial statements, contracts, CRM history.

  • Training agents with "memory" and planning capabilities.

  • Eliminating the need for segmentation in RAG systems.

Ready to start?

Interested in this article?

Let's explore together how AI can transform your business.

Contact us