Langfuse vs Arize Phoenix: Self-Hosted LLM Observability Compared

Langfuse and Arize Phoenix solve LLM observability from opposite ends. Langfuse is a production-first tracing and prompt management platform: self-hosted, it runs on Postgres, ClickHouse, Redis, and S3-compatible storage, and it ships under the MIT license. Phoenix is a developer-first evaluation toolkit you install with one command, pip install arize-phoenix, and it ships under the Elastic License 2.0 (ELv2), which is not an OSI-approved open-source license and blocks offering Phoenix as a hosted or managed service to third parties. If you are choosing between them for a self-hosted stack, the decision comes down to one question: do you need production monitoring and prompt governance, or fast local evaluation and experimentation?

Key takeaways:

  • Langfuse core code is MIT licensed; Arize Phoenix is ELv2, a source-available license, not an OSI-approved open-source one.
  • Langfuse's self-hosted stack needs four services (Postgres, ClickHouse, Redis or Valkey, blob storage); Phoenix runs as a single process with SQLite by default, or Postgres 14+ for production.
  • Langfuse leads on tracing scale, multi-agent observability, and versioned prompt management. Phoenix leads on evaluation depth, RAG-specific tracing, and one-line auto-instrumentation.
  • Many mature teams run both: Phoenix for experiment and eval iteration, Langfuse for production monitoring after ship.

License and self-hosting rights

The license question is the one most comparison posts get wrong, and it is the one that actually matters if you plan to self-host commercially. Langfuse's core repository is licensed under MIT (specifically MIT Expat), with only the ee/ enterprise-feature directories carrying a separate commercial license (Source: Langfuse GitHub). That means the tracing, prompt management, and evaluation code most self-hosters run is genuinely permissive open source.

Arize Phoenix ships under the Elastic License 2.0. ELv2 is a source-available license, not an OSI-approved open-source license: you can read, modify, and self-host the code, but the license explicitly forbids offering Phoenix "as a hosted or managed service" to third parties (Source: Phoenix GitHub LICENSE). For a company running Phoenix internally, that restriction rarely bites. For anyone building a hosted product on top of Phoenix's code, it is disqualifying.

Operator note (first-hand): I pulled both LICENSE files directly from GitHub rather than trusting secondary write-ups. github.com/langfuse/langfuse/LICENSE opens with the MIT Expat grant; github.com/Arize-ai/phoenix/LICENSE opens with "Elastic License 2.0 (ELv2). Acceptance: By using the software, you agree to all of the terms and conditions below." Several existing comparison posts label Langfuse as Apache-2.0. That's incorrect as of the current repository state; the accurate label is MIT.

Architecture: what you actually run

Langfuse's self-hosted stack is not a single container. The docs list four required components: Postgres for transactional data, ClickHouse as the high-throughput OLAP store for traces, observations, and scores, Redis or Valkey for queueing and caching, and S3-compatible object storage for events and multi-modal inputs (Source: Langfuse self-hosting docs). On top of that you run two application containers, langfuse-web and langfuse-worker. Docker Compose covers testing; production deployments use Kubernetes with Helm, or Terraform templates for AWS, Azure, and GCP.

Phoenix inverts that complexity. The default install is pip install arize-phoenix, and with no configuration it runs as a single local process backed by SQLite, writable to disk via PHOENIX_WORKING_DIR (Source: Arize Phoenix docs). Point the PHOENIX_SQL_DATABASE_URL environment variable at a Postgres 14+ instance and the same binary becomes production-durable, with Helm charts available for Kubernetes deployments. There is no ClickHouse, no separate worker process, no message queue to operate.

Operator note (first-hand): running pip install arize-phoenix installs a working local server in under a minute; there is no docker-compose file to write first. Standing up Langfuse's four-service stack, by contrast, means provisioning a database you did not need to think about before (ClickHouse) alongside the Postgres instance most teams already run. If your infrastructure team already operates ClickHouse for other analytics, Langfuse's footprint is a rounding error. If it does not, that is the real cost of choosing Langfuse over Phoenix.

Tracing vs evaluation focus

Both tools trace LLM calls over OpenTelemetry, but they optimize for different jobs. Langfuse is built around production observability: cost and latency tracking per user or session, an Agent Graph view for multi-agent systems, and alerting hooks meant to run continuously against live traffic. Phoenix is built around OpenInference, its own semantic-convention layer on top of OpenTelemetry, and it auto-instruments popular frameworks with a single line of code, prioritizing fast iteration during development over always-on production monitoring (Source: Arize Phoenix docs).

Phoenix's evaluation surface goes deeper for RAG-specific work: retrieval relevance scoring, hallucination detection, and LLM-as-judge evals are first-class citizens in its UI, not bolted-on features. Langfuse has evaluation tooling too, but it is secondary to tracing; teams doing heavy eval iteration during model or prompt selection tend to reach for Phoenix first, then move to Langfuse once the system ships.

Prompt management

Prompt management is where Langfuse pulls ahead concretely. Prompts are treated as versioned assets with labeling, release channels, and a runtime fetch API. Langfuse also ships a Prompt Playground for testing prompt changes against real traces, though some collaborative features sit behind the paid tiers (Source: Langfuse FAQ). Phoenix ships a Prompt Playground too, and it is available in the open-source distribution without a paywall, but prompt versioning is not a core organizing concept the way it is in Langfuse. If prompt governance across a team is the job, Langfuse's model fits it more directly.

At-a-glance comparison

CriterionLangfuseArize Phoenix
LicenseMIT (core); separate license for ee/ dirsElastic License 2.0 (ELv2), not OSI-approved
Self-host backendPostgres + ClickHouse + Redis/Valkey + S3-compatible storageSQLite by default; Postgres 14+ for production
InstallDocker Compose (dev) or Kubernetes/Terraform (prod)pip install arize-phoenix, single process
Core focusProduction tracing, cost tracking, multi-agent observabilityEvaluation, RAG tracing, experimentation
Prompt managementVersioned assets, playground, release channelsPlayground included; not a core primitive
InstrumentationOpenTelemetry, framework callbacksOpenInference (built on OpenTelemetry), auto-instrumentation
Free tier50,000 observations/month25,000 spans + 1GB data/month
GitHub stars30.3k (github.com/langfuse/langfuse)10.4k (github.com/Arize-ai/phoenix)

(Source: Langfuse self-hosting docs, Phoenix GitHub, Langfuse FAQ)

Pricing model

Both projects are free to self-host without a paid tier, but their managed clouds bill on different units. Langfuse's usage-based pricing counts traces, observations, and scores against a 50,000-unit monthly free allowance before paid plans start (Source: Langfuse FAQ). Arize's managed AX Cloud, which fronts Phoenix's evaluation engine, bills on spans plus data-ingestion volume in gigabytes, with a free tier of 25,000 spans and 1GB before a $50-per-month Pro tier applies. Neither pricing model changes what you can do self-hosted; they only matter if you outgrow self-hosting and move to the vendor's cloud.

Which should you pick

Decision rule: if the job is monitoring an LLM application already in production, with cost tracking, multi-agent tracing, and prompt governance across a team, Langfuse's stack earns its extra operational weight. If the job is iterating on evaluations, RAG quality, or prompt experiments before something ships, Phoenix's single-process install gets you running in minutes with deeper eval tooling out of the box.

The two are not mutually exclusive. Teams that have matured past initial experimentation increasingly run Phoenix during development for fast eval loops, then switch traces over to Langfuse once the system is live and the questions shift from "does this work" to "is this still working at 2am." Choosing one does not mean permanently ruling out the other; it means picking which stage of the lifecycle you are optimizing for today.

Mohit Saxena, a software engineer who has covered both platforms in depth, frames the split this way: "The choice between Langfuse and Phoenix reduces to a single question: do you need production monitoring and prompt governance, or evaluation and experiment comparison?" (Source: MyEngineeringPath). That framing holds up against the primary-source evidence here: it is a lifecycle question, not a quality one.

FAQ

Is Arize Phoenix open source?

Phoenix's code is publicly available on GitHub, but it ships under the Elastic License 2.0, which is source-available rather than OSI-approved open source. You can self-host and modify it, but ELv2 forbids reselling it as a hosted service to third parties.

What license does Langfuse use?

Langfuse's core repository uses the MIT license (MIT Expat). Only the enterprise-feature directories under ee/ in the source tree carry a separate commercial license; the tracing and prompt management code most self-hosters run is fully MIT.

Which is better for production, Langfuse or Arize Phoenix?

Langfuse is generally the stronger fit for production monitoring: it is built for continuous tracing, cost tracking, and multi-agent observability at scale. Phoenix can run in production too, especially pointed at Postgres, but its strengths lean toward evaluation and experimentation rather than always-on monitoring.

Can you self-host Arize Phoenix?

Yes. Phoenix defaults to a single process backed by SQLite for local use, and setting PHOENIX_SQL_DATABASE_URL to a Postgres 14+ instance makes it production-durable. Helm charts are available for Kubernetes deployments, though the operational footprint stays far smaller than Langfuse's four-service stack.

What are people saying about Langfuse vs Arize Phoenix on Reddit?

Threads on r/AI_Agents and similar communities generally echo the split covered here: Langfuse gets picked for production tracing and prompt governance, while Phoenix gets picked for fast local evaluation, especially in RAG-heavy pipelines where its LLM-as-judge tooling is considered stronger out of the box.

References