Amazing work. Really nice that you included actual numbers (0.93 vs 0.78) in your semantic chunking example. When the synthesis recommends a method, does it pull benchmark comparisons across papers so I can see how methods perform on the same metrics? That's the thing that takes me forever to compile manually.
Yes - that's a pain, especially because different papers often use different test sets, test settings or different metrics. What we do is find benchmarks across papers; aggregate them, highlight what is comparable and what is not
actually, we also just give a short set of bullet points to summarize all of it for you - so accepting the output is easy for you