Using Machine Learning to Understand Page Templates

philipodonnell · on July 6, 2018

When I saw this title I assumed it was about using ML to build dynamic scrapers that self-configure where the changing data elements. Does that exist anywhere?

ksahin · on July 7, 2018

I think this is what Diffbot does !

dwynings · on July 8, 2018

At a very high level, it's similar. We use computer vision and ML to extract structured data from any web page, even ones we haven't seen before. https://www.diffbot.com/

If anyone has any questions or wants to try it out, feel free to email me directly at dru@diffbot.com

autokad · on July 6, 2018

interesting read. I liked the part about identifying rare templates. plus, if I ever plan to do ML on html the article gave me a good place to start on feature extraction.

might have some uses in identifying phishing urls, etc

abadon · on July 6, 2018

The title should be "Clustering HTML pages". It's not a terribly interesting application. The only thing new I got from it was a de-noising technique.

inputcoffee · on July 6, 2018

Not sure why you're getting down-voted. You're right, the author didn't close the loop. Typically we would expect to see an insight after you apply the ML technique.

So they extracted features, clustered the pages and found... what?

I am sure they learned something but it might be proprietary.

pagnol · on July 6, 2018

What I'd really like to see is a presentation of an algorithm that automatically recognizes and hides the first dismissive HN comment that inevitably appears. Any takers?

goostavos · on July 6, 2018

Recognize, hide, and automatically post to /iamverysmart.

abadon · on July 6, 2018

You can do it yourself, if you're so inclined. Out-of-the-box sentiment analysis is ~90% accurate. Feel free to provide training data for it.

rimliu · on July 6, 2018

Why would you want to have your comment hidden?

On the more serious note I would not mind a but more critical and less clickbaity attitude in the tech. Not every if statement or regexp is ML and AI, not eveything requires blockchain.

jacquesm · on July 6, 2018

Recursively?

edhu2017 · on July 6, 2018

most of the images won't load for me.

coding123 · on July 6, 2018

and the title of the browser tab is flashing at me...