Top LinkedIn Content on Machine Learning Algorithms

AI/ML Research Scientist

2,484 followers 7mo

We taught LSTMs to run in parallel. Now they've grown to 7B parameters, and are ready to challenge Transformers. For years, we’ve assumed RNNs were doomed—inherently sequential, too slow to train, impossible to scale—and looked at Transformers as the go-to choice for Large Language Modelling. Turns out we just needed better math. Introducing 𝗣𝗮𝗿𝗮𝗥𝗡𝗡: 𝗨𝗻𝗹𝗼𝗰𝗸𝗶𝗻𝗴 𝗣𝗮𝗿𝗮𝗹𝗹𝗲𝗹 𝗧𝗿𝗮𝗶𝗻𝗶𝗻𝗴 𝗼𝗳 𝗡𝗼𝗻𝗹𝗶𝗻𝗲𝗮𝗿 𝗥𝗡𝗡𝘀 𝗳𝗼𝗿 𝗟𝗟𝗠𝘀 👉 [TL;DR] We can now train nonlinear RNNs at unprecedented scales, by parallelising what was previously considered inherently sequential—the unrolling of recurrent computations. If you care about fast inference for LLMs, or are into time-series analysis, we got good news for you: RNNs are back on the menu. 🐍 But wait, doesn’t Mamba parallelise this too? Sure, but here's the catch: Mamba requires state space updates to be linear, fundamentally affecting expressivity. We want the freedom to apply nonlinearities sequence-wise. 💡 Our approach: Recast the sequence of nonlinear recurrences as a system of equations, then solve them in parallel using Newton's method. As a bonus, make everything blazingly fast with custom CUDA kernels. ⚡ The result? Up to 665x speedup over naive sequential processing, and training times comparable to Mamba, even with the extra overhead from Newton’s iterations. 📈 So we took LSTM and GRU architectures—remember those from the pre-Transformer era?—scaled them to 7B parameters, and achieved perplexity comparable to similarly-sized Transformers. No architectural tricks. Just pure scale, finally unlocked. 🔥 Why this matters: Mamba challenged the Transformer’s monopoly. ParaRNN expands the search space of available architectures. It’s time to get back to the drawing board and use these tools to start designing the next generation of inference-efficient models. 💻 To aid with this, we’re releasing open-source code to parallelise RNN applications, out-of-the box. No need to bother implementing your own parallel scan, nor trying to remember how Newton works: just prescribe the recurrence relationship, flag eventual structures in your hidden state update, and watch GPUs go 𝘣𝘳𝘳𝘳𝘳𝘳𝘳𝘳. Paper: https://lnkd.in/dTEGh5Jp Code: https://lnkd.in/d_Ven9Y2 Collaborators: Pau Rodriguez Lopez, Miguel Sarabia, Xavier Suau, Luca Zappella --------------------------------------------- 💼 And if you're a PhD student interested in working on these topics, we got a fresh internship position just for you: https://lnkd.in/dDVSsfJj 𝗧𝗶𝗺𝗲 𝘁𝗼 𝗲𝘅𝗽𝗹𝗼𝗿𝗲 𝘄𝗵𝗮𝘁 𝘁𝗿𝘂𝗹𝘆 𝗻𝗼𝗻𝗹𝗶𝗻𝗲𝗮𝗿 𝗥𝗡𝗡𝘀 𝗰𝗮𝗻 𝗱𝗼 𝗮𝘁 𝘀𝗰𝗮𝗹𝗲

56 Comments

David Klindt

Assistant Professor in Neuroscience and Artificial Intelligence Research

2,712 followers 2w

📄 What does a JEPA actually learn? We can finally prove it 🌍 A World Model is an AI's internal map of how the world works. The dream is that it learns the real structure of the world, not some scrambled version of it. In our field that property has a name: identifiability. So excited to share our new theory of identifiable World Models. The setup: the world has hidden latent variables (think the pose of an object, or the angle of a joint). We never see them directly, only data that mixes them up in complicated nonlinear ways. The question is whether a model can undo that mixing and recover the true latents. We prove that LeJEPA does. Train an encoder to align related views while keeping its latent space Gaussian (that's SIGReg), and it provably recovers the world's latent variables, up to a simple rotation. An undistorted map. Why does that matter? An undistorted map is exactly what you need to plan. We prove that a plan made in the learned World Model is identical to the plan in the real world: same actions, same outcome. We then show it on a robot (Reacher) controlled straight from pixels. The biggest surprise was the Gaussian. Coming from independent component analysis (ICA), a Gaussian latent looks like the worst possible choice, since classic ICA can only separate sources when they are non-Gaussian. Based on that, I'd have bet against it. But we prove the converse: in our setting the Gaussian is the unique distribution that gives clean linear recovery. Every other distribution is still recoverable, but only up to a nonlinear distortion 🤯 Two more things I'm proud of. We prove an approximate bound for the realistic case where training only gets close to the optimum (the kind of empirically grounded identifiability theory we've argued the field needs). And we machine-checked all five theorems in Lean 4, zero sorry, with the standard textbook lemmas axiomatized 🤓 I think identifiability is the right definition of what it means to learn a World Model. The natural next step for the theory is to add action conditioning, like in LeWorldModel. Huge thanks to my collaborators Randall Balestriero and Yann LeCun for their guidance and support 🔬🤖 ➡️ Paper: https://lnkd.in/e_ytBAbs ➡️ Site: https://lnkd.in/e4TcyBYD

26 Comments

Lior Alexander

Helping devs stay up to date with AI. CEO at AlphaSignal.

209,722 followers 6mo

Karpathy is back. His new LLM-Council might be the future of how LLMs actually get used. Here’s how it works: 1. Your prompt fans out to multiple models ▸ GPT, Claude, Gemini, Grok, whatever you add. ▸ Each model answers the same query independently. No shared context. No coordination. Pure first-pass reasoning. 2. Then the models see each other’s answers ▸ All responses are revealed to every model, anonymized. ▸ No one knows who wrote what. ▸ This removes brand bias and forces actual evaluation. 3. Every model becomes a critic Each model: → ranks the answers → flags mistakes → explains weaknesses → highlights better reasoning This gives you per-query evaluation instead of a static benchmark. 4. A Chairman model makes the final call It gets: ▸ all answers ▸ all rankings ▸ all critiques Then it produces one final response by merging the strongest reasoning and correcting the errors exposed by the council. This is routing based on evidence, not vibes. 5. You see one clean output ▸ The UI looks like ChatGPT. ▸ Under the hood: workers → critics → synthesis. A simple, transparent LLM router that judges models on each task, instead of asking you to trust a single guess.

88 Comments

Aishwarya Srinivasan

635,304 followers 7mo

If you’re an AI engineer trying to understand how reasoning actually works inside LLMs, this will help you connect the dots. Most large language models can generate. But reasoning models can decide. Traditional LLMs followed a straight line: Input → Predict → Output. No self-checking, no branching, no exploration. Reasoning models introduced structure, a way for models to explore multiple paths, score their own reasoning, and refine their answers. We started with Chain-of-Thought (CoT) reasoning, then extended to Tree-of-Thought (ToT) for branching, and now to Graph-based reasoning, where models connect, merge, or revisit partial thoughts before concluding. This evolution changes how LLMs solve problems. Instead of guessing the next token, they learn to search the reasoning space- exploring alternatives, evaluating confidence, and adapting dynamically. Different reasoning topologies serve different goals: • Chains for simple sequential reasoning • Trees for exploring multiple hypotheses • Graphs for revising and merging partial solutions Modern architectures (like OpenAI’s o-series reasoning models, Anthropic’s Claude reasoning stack, DeepSeek R series and DeepMind’s AlphaReasoning experiments) use this idea under the hood. They don’t just generate answers, they navigate reasoning trajectories, using adaptive depth-first or breadth-first exploration, depending on task uncertainty. Why this matters? • It reduces hallucinations by verifying intermediate steps • It improves interpretability since we can visualize reasoning paths • It boosts reliability for complex tasks like planning, coding, or tool orchestration The next phase of LLM development won’t be about more parameters, it’ll be about better reasoning architectures: topologies that can branch, score, and self-correct. I’ll be doing a deep dive on reasoning models soon on my Substack- exploring architectures, training approaches, and practical applications for engineers. If you haven’t subscribed yet, make sure you do: https://lnkd.in/dpBNr6Jg ♻️ Share this with your network 🔔 Follow along for more data science & AI insights

55 Comments

Tomasz Tunguz

406,472 followers 3mo

I started by asking AI to do everything. Six months later, 65% of my agent’s workflow nodes run as non-AI code. The first version was fully agentic : every task went to an LLM. LLMs would confidently progress through tasks, though not always accurately. So I added tools to constrain what the LLM could call. Limited its ability to deviate. I added a Discovery tool to help the AI find those tools. Better, but not enough. Then I found Stripe’s minion architecture. Their insight : deterministic code handles the predictable ; LLMs tackle the ambiguous. I implemented blueprints, workflow charts written in code. Each blueprint specifies nodes, transitions between them, trigger conditions for matching tasks, & explicit error handling. This differs from skills or prompts. A skill tells the LLM what to do. A blueprint tells the system when to involve the LLM at all. Each blueprint is a directed graph of nodes. Nodes come in two types : deterministic (code) & agentic (LLM). Transitions between nodes can branch based on conditions. Deal pipeline updates, chat messages, & email routing account for 29% of workflows, all without a single LLM call. Company research, newsletter processing, & person research need the LLM for extraction & synthesis only. Another 36%. The workflow runs 67-91% as code. The LLM sees only what it needs : a chunk of text to summarize, a list to categorize, processed in one to three turns with constrained tools. Blog posts, document analysis, bug fixes are genuinely hybrid. 21% of workflows. Multiple LLM calls iterate toward quality. Only 14% remain fully agentic. Data transforms & error investigations. These tend to be coding tasks rather than evaluating a decision point in a workflow. The LLM needs freedom to explore. AI started doing everything. Now it handles routing, exceptions, research, planning, & coding. The rest runs without it. Is AI doing less? Yes. Is the system doing more? Also yes. The blueprints, the tools, the skills might be temporary scaffolding. With each new model release, capabilities expand. Tasks that required deterministic code six months ago might not tomorrow.

41 Comments

Andriy Burkov

PhD in AI, author of 📖 The Hundred-Page Language Models Book and 📖 The Hundred-Page Machine Learning Book

488,375 followers 3w

An absolute must read. LLMs cost a lot to run, so a common move is to train a small model to imitate a big one — feeding the small "student" the same inputs and having it match, word by word, the probabilities the large "teacher" assigns to each possible next word, a procedure called knowledge distillation. That matching is done on a fixed collection of example sentences, but a model writing text builds each sentence out of its own earlier words, so once the student makes an early choice that none of the training examples contained, it ends up in situations it was never shown, and small mistakes feed into later ones until the text degrades. In this ICLR 2024 paper from Google, Mila, and UoT, the authors instead have the student write sentences itself and use those sentences to choose the situations it gets tested on: at each point in a student-written sentence they take the words so far, ask the teacher what the distribution over the next word should be there, and push the student toward the teacher's answer — so the teacher supplies every target while the student's own writing decides where those targets get applied, which is exactly the off-track spots its writing tends to wander into. Tested on summarization, English-to-German translation, and grade-school math problems where the model writes out its reasoning before answering, this self-generated-data approach beats standard distillation recipes across a range of student sizes, and it slots into reinforcement-learning fine-tuning cleanly because both only need samples drawn from the student rather than gradients passed back through the sampling step. Read with an AI tutor and quizzes for better retention: https://lnkd.in/efguF7mr PDF: https://lnkd.in/edfTWfgt

4 Comments

Ross Dawson

36,277 followers 1y

LLMs are optimized for next turn response. This results in poor Human-AI collaboration, as it doesn't help users achieve their goals or clarify intent. A new model CollabLLM is optimized for long-term collaboration. The paper "CollabLLM: From Passive Responders to Active Collaborators" by Stanford University and Microsoft researchers tests this approach to improving outcomes from LLM interaction. (link in comments) 💡 CollabLLM transforms AI from passive responders to active collaborators. Traditional LLMs focus on single-turn responses, often missing user intent and leading to inefficient conversations. CollabLLM introduces a :"Multiturn-aware reward" system, apply reinforcement fine-tuning on these rewards. This enables AI to engage in deeper, more interactive exchanges by actively uncovering user intent and guiding users toward their goals. 🔄 Multiturn-aware rewards optimize long-term collaboration. Unlike standard reinforcement learning that prioritizes immediate responses, CollabLLM uses forward sampling - simulating potential conversations - to estimate the long-term value of interactions. This approach improves interactivity by 46.3% and enhances task performance by 18.5%, making conversations more productive and user-centered. 📊 CollabLLM outperforms traditional models in complex tasks. In document editing, coding assistance, and math problem-solving, CollabLLM increases user satisfaction by 17.6% and reduces time spent by 10.4%. It ensures that AI-generated content aligns with user expectations through dynamic feedback loops. 🤝 Proactive intent discovery leads to better responses. Unlike standard LLMs that assume user needs, CollabLLM asks clarifying questions before responding, leading to more accurate and relevant answers. This results in higher-quality output and a smoother user experience. 🚀 CollabLLM generalizes well across different domains. Tested on the Abg-CoQA conversational QA benchmark, CollabLLM proactively asked clarifying questions 52.8% of the time, compared to just 15.4% for GPT-4o. This demonstrates its ability to handle ambiguous queries effectively, making it more adaptable to real-world scenarios. 🔬 Real-world studies confirm efficiency and engagement gains. A 201-person user study showed that CollabLLM-generated documents received higher quality ratings (8.50/10) and sustained higher engagement over multiple turns, unlike baseline models, which saw declining satisfaction in longer conversations. It is time to move beyond the single-step LLM responses that we have been used to, to interactions that lead to where we want to go. This is a useful advance to better human-AI collaboration. It's a critical topic, I'll be sharing a lot more on how we can get there.

14 Comments

Sahar Mor

I help researchers and builders make sense of AI | ex-Stripe | aitidbits.ai | Angel Investor

42,134 followers 1y

Researchers have unveiled a self-harmonized Chain-of-Thought (CoT) prompting method that significantly improves LLMs’ reasoning capabilities. This method is called ECHO. ECHO introduces an adaptive and iterative refinement process that dynamically enhances reasoning chains. It starts by clustering questions based on semantic similarity, selecting a representative question from each group, and generating a reasoning chain using zero-shot CoT prompting. The real magic happens in the iterative process: one chain is regenerated at random while others are used as examples to guide the improvement. This cross-pollination of reasoning patterns helps fill gaps and eliminate errors over multiple iterations. Compared to existing baselines like Auto-CoT, this new approach yields a +2.8% performance boost in arithmetic, commonsense, and symbolic reasoning tasks. It refines reasoning by harmonizing diverse demonstrations into consistent, accurate patterns and continuously fine-tunes them to improve coherence and effectiveness. For AI engineers working at an enterprise, implementing ECHO can enhance the performance of your LLM-powered applications. Start by training your model to identify clusters of similar questions or tasks in your specific domain. Then, implement zero-shot CoT prompting for each representative task, and leverage ECHO’s iterative refinement technique to continually improve accuracy and reduce errors. This innovation paves the way for more reliable and efficient LLM reasoning frameworks, reducing the need for manual intervention. Could this be the future of automatic reasoning in AI systems? Paper https://lnkd.in/gAKJ9at4 — Join thousands of world-class researchers and engineers from Google, Stanford, OpenAI, and Meta staying ahead on AI http://aitidbits.ai

1 Comment

Alon Bochman

12,677 followers 2y

Want to boost LLM performance? Merge two LLMs together. I used to be active in data science competitions on Kaggle. The way to win a Kaggle competition is generally to create the biggest ensemble of models you can. Each model excels in its own corner of the prediction space, and when you put them together, you generally get a performance boost. Kind of like asking the same question of a lot of smart people. This same technique is coming to large language models. It is called merging. Merging is cost-effective (no GPU required) and produces winners. For example, the Marcoro14-7B-slerp model, created using the mergekit library (link below), became the best-performing model on the Open LLM Leaderboard as of Feb 1, 2024. The most common model merging technique is called SLERP (Spherical Linear Interpolation). Here’s how it works: 1/Normalization: The input vectors from the LLMs are normalized to unit length. This ensures they represent directions rather than magnitudes1. 2/Angle Calculation: The angle between these vectors is calculated using their dot product1. 3/Interpolation: Spherical Linear Interpolation (SLERP) is used to smoothly interpolate between the vectors1. It maintains a constant rate of change and preserves the geometric properties of the spherical space in which the vectors reside1. 4/Weight Calculation: Scale factors based on the interpolation factor and the angle between the vectors are computed. These factors are used to weigh the original vectors. 5/Vector Summation: The weighted vectors are then summed to obtain the interpolated vector. Another technique, BRANCH-SOLVE-MERGE (BSM) from Meta, has shown significant improvements in evaluation correctness and consistency for each LLM, enhancing human-LLM agreement by up to 26%, reducing length and pairwise position biases by up to 50%. It also improved the coherence of the stories while also improving constraint satisfaction by 12%. Want to try it out? Start with MergeKit (https://buff.ly/4bg4wU1) Here are a few more resources: BSM paper: https://buff.ly/3vn0uck LLM-Slerp-Merge: https://buff.ly/4a6bREH HuggingFace article on LLM merging: https://buff.ly/43s3hO1 #ArtificialIntelligence #AIResearch #DeepLearning #NLP #LLM #ModelMerging

7 Comments

Smriti Mishra

Data & AI | LinkedIn Top Voice Tech & Innovation | Mentor @ Google for Startups | 30 Under 30 STEM

89,258 followers 1y

What if your smartest AI model could explain the right move, but still made the wrong one? A recent paper from Google DeepMind makes a compelling case: if we want LLMs to act as intelligent agents (not just explainers), we need to fundamentally rethink how we train them for decision-making. ➡ The challenge: LLMs underperform in interactive settings like games or real-world tasks that require exploration. The paper identifies three key failure modes: 🔹Greediness: Models exploit early rewards and stop exploring. 🔹Frequency bias: They copy the most common actions, even if they are bad. 🔹The knowing-doing gap: 87% of their rationales are correct, but only 21% of actions are optimal. ➡The proposed solution: Reinforcement Learning Fine-Tuning (RLFT) using the model’s own Chain-of-Thought (CoT) rationales as a basis for reward signals. Instead of fine-tuning on static expert trajectories, the model learns from interacting with environments like bandits and Tic-tac-toe. Key takeaways: 🔹RLFT improves action diversity and reduces regret in bandit environments. 🔹It significantly counters frequency bias and promotes more balanced exploration. 🔹In Tic-tac-toe, RLFT boosts win rates from 15% to 75% against a random agent and holds its own against an MCTS baseline. Link to the paper: https://lnkd.in/daK77kZ8 If you are working on LLM agents or autonomous decision-making systems, this is essential reading. #artificialintelligence #machinelearning #llms #reinforcementlearning #technology

24 Comments

LinkedIn respects your privacy

Machine Learning Algorithms

Explore categories

Machine Learning Algorithms

More in Machine Learning Algorithms

More Technology topics

Explore categories