ShellHounds: Rapid Tactical Prototyping Lab

A Division of Klavan Security

AI Data Collection & Value Assessment - ShellHounds

SHELLHOUNDS

AI Data Collection & Value Assessment

Understanding what AI services collect, how your data is used, and what it's actually worth in the AI economy

AI Models & Data Collection Practices

Large language models (LLMs) like ChatGPT, Claude, and Google Gemini collect and process vast amounts of data through user interactions. Understanding what data is collected and how it's used is critical for personal and organizational security.

What AI Services Typically Collect

Most AI models collect all prompt inputs, generated outputs, conversation history, and metadata about user interactions. This data is often retained indefinitely unless specific data handling agreements are in place.

  • Prompt Inputs: Everything you type into the AI, including questions, requests, and any data you share
  • Generated Outputs: All responses created by the AI model
  • Conversation Context: The full history and flow of each interaction session
  • Interaction Metadata: Time, frequency, device information, usage patterns
  • Personal Identifiers: Account information, IP addresses, and potentially linked identities
  • Uploaded Files: Documents, images, and other media shared with the AI for analysis

The Subscription Privacy Paradox

Many users assume paid subscriptions provide complete privacy protection. The reality is more nuanced:

  • Enterprise tiers typically offer the strongest protections, often with contractual terms preventing training on your data
  • Individual paid subscriptions may offer limited additional protections, but most still collect extensive data
  • Payment does not equal privacy - The primary benefits of paid tiers are often performance and features, not data protection
  • Terms of service matter more than subscription status - Always review the specific data handling policies

Major AI Services Comparison

Different AI services have varying approaches to data collection, storage, usage, and user rights. Understanding these differences is critical for making informed choices about which services to use and how to configure them.

AI Service Data Collection Default Retention Training Usage Opt-Out Options User Data Rights Key Differences
ChatGPT Free All conversations and inputs Indefinite (30-day deletion option) Yes - Used to improve models Basic - Can opt out of training only Limited - Can delete history but data may persist Most permissive data usage; lowest privacy protection
ChatGPT Plus All conversations and inputs Indefinite (30-day deletion option) Yes - Used to improve models Basic - Same opt-out options as free Limited - Same as free tier Payment provides features, not significant privacy improvements
ChatGPT Enterprise All conversations and inputs Company-defined retention No - Not used for training by default Advanced - Business data not used for training Enhanced - Better deletion options, admin controls Business data protected but still stored on OpenAI servers
Claude Free All conversations and inputs Indefinite Yes - Used to improve models Limited - Few opt-out options Basic - Can request data deletion Somewhat less aggressive data collection than some competitors
Claude Pro All conversations and inputs Indefinite Yes - Used to improve models Limited - Same as free tier Basic - Same as free tier Payment provides features, minimal privacy improvements
Claude Enterprise All conversations and inputs Configurable No - Not used for training by default Advanced - Can prevent training and data sharing Enhanced - Admin controls, configurable retention Strong business data protections with custom agreements
Google Gemini Free All conversations and account activity 18 months+ (linked to Google Account) Yes - Used to improve Google services Moderate - Google activity controls apply Moderate - Tied to overall Google data rights Integrated with Google ecosystem; broader data collection
Google Gemini Advanced All conversations and account activity 18 months+ (linked to Google Account) Yes - Used to improve Google services Moderate - Same as free tier Moderate - Same as free tier Payment provides features, not privacy improvements
Microsoft Copilot All conversations and inputs Indefinite by default Yes - Used to improve Microsoft services Limited - Few explicit AI-specific controls Moderate - Standard Microsoft data rights Deeply integrated with Microsoft products
Copilot for Microsoft 365 Work content, documents Follows company retention policies No - Not used to train models Advanced - Commercial data protection commitments Enhanced - Admin controls, tenant isolation Business data stays within tenant; strong protections
Open Source LLMs
(Llama, Mistral, etc.)
Depends on implementation Depends on implementation No - Self-hosted doesn't send data back Complete - When self-hosted Complete - When self-hosted Self-hosting provides complete control but requires technical expertise

The Economic Value of Your AI Data

Data provided to AI systems has significant economic value to AI companies. This data is used to improve models, create new products, and generate competitive advantages. Understanding this value helps organizations make informed decisions about AI usage policies.

Data Value Economics

The estimated values below represent the approximate economic worth of user data to AI companies. All financial figures are in USD.

Model Training Value

Data used to train and fine-tune AI models can be worth $0.10-$50 per interaction depending on uniqueness and domain expertise.

Product Development Value

User interactions inform new features and products, with specialized industry data potentially worth $100+ per conversation.

Competitor Intelligence

Proprietary information shared with AI systems could provide competitive insights worth thousands to competitors.

Market Research Value

User behavior patterns and industry-specific prompts inform AI company strategy and can be monetized through specialized AI offerings.

Data Type Value Indicator Est. Value to AI Company
Generic queries/conversations Low $0.01-$1 per interaction
Domain expertise (legal, medical, etc.) Medium $1-$20 per interaction
Proprietary business processes High $20-$100+ per interaction
Internal strategic information High $100-$10,000+ (competitive value)
Customer/PII data High $1-$100 per record (compliance risk)

Paid Subscriptions vs. Free Tiers

While paid subscriptions generally offer improved privacy terms, the economic value of data from paid users is often higher due to:

  • Higher quality inputs - Paid users tend to share more complete, thoughtful, and specialized information
  • Professional context - Business users often include proprietary processes and domain expertise in their interactions
  • More consistent usage - Paid users typically engage more deeply and frequently with the service
  • Enterprise data may be protected from training but still provides valuable market insights to AI providers

Sources: Value estimates based on research on data valuation in machine learning, Accenture's AI Data Economy report, and Microsoft Research on AI economics. All values are approximations as actual monetization methodologies are proprietary.

User Profiles: Value & Risk Assessment

Different types of AI users face varying levels of risk and provide different amounts of value to AI companies based on their usage patterns and subscription status. (All monetary values in USD)

Enterprise User
Business User
Premium Personal
Casual Personal

Privacy & Security Risks

Sharing information with AI services creates various privacy and security risks that organizations should assess and mitigate. Understanding these risks helps implement appropriate safeguards.

  • Data Leakage: Sensitive information shared in prompts may be exposed through model outputs to other users
  • Intellectual Property Exposure: Proprietary information may be incorporated into models that competitors can access
  • Compliance Violations: PII, PHI, financial data, and other regulated information may violate governance requirements
  • Shadow AI: Employees using unauthorized AI services with corporate data creates unmonitored risk
  • Training Data Extraction: Adversaries can potentially extract information from models that was used in training
  • Prompt Injection: Malicious actors may exploit vulnerabilities in AI systems to extract sensitive information

Risk Vectors by AI Deployment Type

Different AI deployment models carry varying levels of risk:

  • Public AI Services: Highest risk - data leaves your environment and usage terms typically grant broad rights to provider
  • Private Cloud Instances: Medium risk - data handled according to specific agreement but still leaves environment
  • On-Premises Models: Lower risk - data remains in your environment but model quality may be lower

AI Data Collection Assessment Methodology

A structured approach to assess and manage the risks associated with AI data collection and usage.

Data Inventory

Catalog what types of data your organization is sharing with AI services

Service Evaluation

Assess the data handling practices of each AI service provider

Risk Classification

Categorize data based on sensitivity and potential impact if exposed

Policy Development

Create governance structures for acceptable AI use

Value Assessment

Calculate the economic value of data being shared with AI services

Controls Implementation

Deploy technical and procedural safeguards to mitigate risks

Assessment Checklist

  • Identify all AI services in use across the organization
  • Review terms of service and data processing agreements
  • Classify data being shared with each service
  • Evaluate compliance implications (GDPR, HIPAA, etc.)
  • Calculate potential economic impact of data sharing
  • Develop appropriate use policies and controls
  • Implement training and awareness programs
  • Consider more secure alternatives where appropriate

Sources: Methodology based on NIST SP 800-53, ISO/IEC 27005, and NIST Risk Management Framework adapted for AI-specific concerns.

REQUEST YOUR AI DATA COLLECTION ASSESSMENT

Our team specializes in evaluating AI services, data collection practices, and implementing appropriate safeguards for your organization.

REQUEST ASSESSMENT

All assessments and services are priced in USD.