Umělá inteligenceJuly 4, 2025|7 min

Training AI Models in 2025: Data, Law, and the Technological Leap Forward

Training large language models has moved from a laboratory discipline into the mainstream of enterprise solution development. Developers today face not only technical challenges...

Tým Apertia

Apertia.ai

Training large language models has moved from a laboratory discipline into the mainstream of enterprise solution development. Developers today face not only technical challenges but also new legal boundaries, the expansion of open-source architectures, and pressure for transparency in data provenance. This article offers a comprehensive overview of what training an AI model entails today - with references to current cases, decisions, and technological developments.

Training Pipeline: What It Consists Of

Training a large language model (LLM) today typically involves these phases:

Phase	Description
Data Collection	Gathering extensive text corpora (web, books, code, documentation)
Filtering and Tokenization	Removing noise, duplicates, tokenization for neural network input
Pre-training	Statistical learning of language structure
Fine-tuning	Tuning the model for a specific domain or communication style
Alignment	Adjusting outputs using feedback (e.g., RLHF)

Architecturally, most teams today use efficient training frameworks like DeepSpeed, Axolotl, vLLM, or are moving to smaller specialized models (Mistral, Phi-3).

What Data Can Be Used for Training and Why It Matters

The choice of training data directly affects both the model's performance and legal standing. In June 2025, a U.S. federal court ruled in favor of Meta, which used pirated copies of books from the Library Genesis database when training the LLaMA model. According to Judge Vince Chhabria, this constituted transformative use (fair use), because the model did not copy texts verbatim but used them to learn language patterns (The Verge, 2025).

Simultaneously, Anthropic defended training its Claude model on scanned physical books purchased legally. Judge William Alsup compared AI training to the way humans learn to read and write: reading is not copying, but learning the principle (AP News, 2025).

Want a Custom AI Solution?

We help companies automate processes with AI. Contact us to find out how we can help you.

Response within 24 hours
No-obligation consultation
Solutions tailored to your business

Data Type Comparison:

Data Type	Advantages	Risks
Books	Language quality, deeper context	Copyright (EU), limited domains
Web articles	Scale, timeliness	Bias, inaccuracy, duplication
Code repositories	Formal syntax, functional examples	GPL/MIT licenses - need to verify legal compatibility
Proprietary company data	Domain relevance, know-how	GDPR, internal data governance, need for pseudonymization
Synthetic data	Controlled content, tunable properties	Potential bias transfer, limited creativity

However, in the European Union fair use cannot be relied upon - here the DSM Directive (2019/790) applies, which allows the use of data for text and data mining, but only if the rights holder has not actively excluded their works from analysis. This means that developers in the EU must document the legal origin of data.

Technology Evolution: Growing Context and Specialization

Training today means not just teaching a model to "read texts," but also handling long context, efficient inference, tools, reasoning, and responsibility.