Building a bilingual risk scoring system: what data actually matters when US and Russian markets give conflicting signals?

I’ve been trying to design a unified risk scoring framework for influencer vetting that works across US and Russian markets, and I’m running into a fundamental problem: the signals that indicate risk in one market don’t always translate to the other.

Here’s an example that’s been eating at me: in the US market, we use follower-to-engagement ratio and audience growth velocity as primary risk indicators. Influencers with sudden spikes or weird engagement patterns get flagged. Makes sense.

But in the Russian market, the same patterns can be completely normal due to algorithmic differences in VK and Rutube, or just how that audience behaves. So when I build a unified model, I either get it tuned for US (and flag too many legitimate Russian creators) or tuned for Russia (and miss actual issues in US).

I’ve talked to a few people using bilingual strategies, and they mention using “market-weighted thresholds” or “region-specific training sets.” But I’m not clear on the practical execution:

  • Are you building completely separate models for each market, or are you using one model with market coefficients?
  • How do you weight data when your Russian dataset is smaller or different quality than your US data?
  • When you have conflicting signals (US model says high risk, Russia says low risk), how do you decide what to actually do?

I want a system that’s responsive and protects our brand, but doesn’t reject creators just because they don’t fit a US-centric benchmark. What’s your approach?

We went through this exact problem. Our first attempt was one unified model, and it was a disaster for the reasons you described.

Here’s what actually works: we built separate models, but they share infrastructure. Think of it like A/B testing—Model US and Model RU run in parallel, and we combine their outputs.

For an influencer, we get two scores: risk_us and risk_ru. Then we apply decision logic:

  • If both are low risk: clear to go
  • If both are high risk: reject
  • If conflicting: escalate to human review with context about why the models disagree

The key is that the human review is informed by the models’ reasoning, not just the scores. We can see why US flagged someone (maybe rapid follower growth) and why Russia cleared them (normal growth pattern for that platform). That context lets us make a real decision.

On dataset quality: we weighted Russian data lower initially because we had less of it, but after 6 months we had enough historical campaign data to give both markets equal weight. The more campaign history you have, the better the model performs.

One practical tip: document every decision where the models disagreed and you had to choose. After 50 of those, you’ll have patterns. Maybe conflicting signals in certain creator tiers are usually false alarms. Maybe certain signal combinations matter more than others. Use that to refine your decision logic.

I think it’s cool that you’re trying to be fair across markets. From a creator’s perspective, it’s frustrating when an algorithm doesn’t understand that growth looks different depending on where your audience is.

One thing I’d suggest: test your risk model with creators who are actually in both markets—bilingual creators or those with distributed audiences. Their growth patterns might be weird to a single-market model but totally legitimate. They might reveal blind spots in your system.

This is actually a more sophisticated problem than most people think. You’re essentially dealing with multi-domain adaptation in machine learning.

Our approach: we built a base model trained on combined data, but with explicit “market bias” variables. So the model learns both universal fraud signals and market-specific normal behavior. It’s more complex to set up, but it captures nuance better.

Regarding conflicting signals: we don’t just average them. We weight by confidence. If US model is highly confident (0.95 probability) of fraud, but Russia model is uncertain (0.55 probability), we trust the US model more. Confidence metadata matters as much as the score itself.

On dataset quality: you’re right to be concerned. If your Russian dataset is smaller, use a Bayesian approach that accounts for uncertainty. Essentially: weight samples inversely to your confidence interval. This prevents low-data markets from introducing noise.

Last thing: build a holdout test set from both markets before you deploy. Run your combined model and measure precision/recall separately per market. If US precision is 92% but Russia is 68%, you need more tuning before launch. A 24-point gap means your model isn’t truly bilingual yet—it’s just two models butted against each other.

This is worth getting right because bad decisions scale fast.