Regex.ai is an AI-powered tool that generates regular expressions. It can accurately generate regular expressions that match specific patterns in text with precision. Whether you're a novice or an expert, Regex.ai's intuitive interface makes it easy to input sample text and generate complex regular expressions quickly and efficiently. Overall, Regex.ai is a game-changer that will save you time and streamline your workflow.
And highlighting the first "baz" produced patterns which all had "[A-Z][a-z]*@libertylabs\\.ai" included, assumedly due to the default inclusions.
Removing those and highlighting the second "baz" resulted "<Agent B>" as the results in one case.
There is no explanation of any patterns generated. If a person is to use one of the generated patterns and Regex.ai is supposed to "save you time and streamline your workflow", no matter "[w]hether you're a novice or an expert", then some form of verification and/or explanation must exist.
Otherwise, a person must know how to formulate regular expressions in order to determine which, if any, of the presented options are applicable. And if a person knows how to formulate regular expressions, then why would they use Regex.ai?
I often find it faster to write something from scratch rather than to work with someone else’s code to fix it. In the latter case I need to understand the intent, the whys behind the choices.
Well guess what, LLM-generated code is someone else’s code: an amalgamation derived from many peoples’ code. Except those people are ‘helpfully’ “abstracted away” from you by the middleman, so you can’t know their original intents and choices. What’s worse, it’s someone else’s code that will be treated as your code—unlike working with a legacy system that everyone knows was written by some guy, in this case any bugs will be squarely on you.
This offering, and the other half-dozen like it this past week or so, is like giving a kid a flamethrower.
It's all fun and games until they burn down your house.
> ... I need to understand the intent, the whys behind the choices.
As do I.
And that is something ChatGPT-X (for any given X) cannot provide, regardless of whether or not what is produced is correct. Perhaps with some form of backward chaining[0] a ChatGPT-X someday can explain how it arrived at what was produced works.
It's weird to see a forum for hackers, with hacker in the name, and with a line about encouraging curiosity in the charter, be so hostile to someone who hacked something together.
Sign of the times perhaps.
Though I guess it's not much different from the thread trashing Dropbox however many years back.
>so hostile to someone who hacked something together.
It's not hostile but I'm a bit tired of all those projects that sprout around AI.
If it was an open-source project full of bugs, I would understand, and encourage and give solutions to the creator of the project, maybe even create tickets or fix bugs.
But with AI, we are flooded with tons of closed-source frontends to a closed-source backend, and those projects are more than buggy since they confidently give bad solutions. It's not like a "DIY electric car project," it's someone putting pieces of cardboard on a Tesla and pretending it makes it safer or faster.
I'm dumbfounded and I don't know how I am supposed to react to this, I would certainly not release that to anyone since it's antithetical to what I do and believe what software should be.
Good point. I wish OpenAI released more of their work as open source. I wish people building on top of them did too. That said, I usually won't begrudge a small-time developer or entrepreneur from choosing whatever licensing model they think is going to make them the most money. An army of small-time entrepreneurs who build closed source can still have democratizing effects on a market that's been captured by a few large companies. I'm more frustrated when I see big, entrenched companies finding ways to capture value from the open source ecosystem and privatize it.
My view on v1s, prototypes and PoCs regardless of their licensing is that by design they're going to be a mess and have errors, if they don't you waited too long to ship. Maybe these folks should have been a little more honest in their marketing but man if we're going to get into a list of the offenders on that front I think they are way way down on that list.
Overall in my view LLMs are the most disruptive thing to come along since the Web itself. Business model's like Google's are facing a direct challenge from this technology. Why do I want to look at Google's first page full of shitty search ads when I can use a LLM to get an answer immediately? As far as I'm concerned at this stage I would love to see a billion projects from every corner of the world built on top of this technology. Whether they're great or they're crap, the avalanche is the first real opportunity in many years to disrupt some giants.
> It's weird to see a forum for hackers, with hacker in the name, and with a line about encouraging curiosity in the charter, be so hostile to someone who hacked something together.
My comment was in direct response to an overarching concern raised by the implications of incorporating "LLM-generated code." This is relevant here due to the "Show HN" description above, which reads thusly:
Regex.ai is an AI-powered tool that generates regular
expressions. It can accurately generate regular expressions
that match specific patterns in text with precision.
If you interpreted my characterization of "... like giving a kid a flamethrower" as being hostile, then I extend my apologies to the OP as I was using this phrase as a literary tool detailed subsequently. I thought the subject expansion of "the other half-dozen like it this past week or so" was sufficient.
As to "encouraging curiosity", I point you to feedback I provided to the OP in a reply peer to this one.
Are you trying to say that every sort of criticism equals hostility? If I dont like your half-thought-out idea, I am hostile. If I praise it, I feel like an idiot. Not much choice remaining after all....
I’m not critical of the hack itself (unless it uses OAI’s closed commercial LLMs). Just not a fan of some implications of using it in real circumstances: it might work for a personal thing but if you use it for anything important you still need to know how regular expressions work.
> It's weird to see a forum for hackers, with hacker in the name, and with a line about encouraging curiosity in the charter, be so hostile to someone who hacked something together.
I guess people are getting tired of too many topics in one narrow space. I come to HN for variety. It does get tiring when every single day I see yet another LLM-based solution attempting to solve a problem I don't think I even have.
Overdose of a certain topic is not good for a general tech forum like this. Everything should be in moderation and all that.
This forum is also against decentralization and Web3, and often shills for large centralized corporations. The ethos of hackers was always ANTI that stuff.
You can ask it to explain why. It might not be a true representation of why those decisions were made but at least it’s a plausible explanation of why something could work like that which is better than nothing. I’m not sure why you think it can’t do that already?
So if I look at most codebases, someone would be able to explain what all the code does and why it does that way? I'm extremely sceptical of that, even if I myself wrote the code 3 weeks ago.
A person should be able to explain the code they're adding to a repo at the time they are adding it. Whether or not they can explain it at some arbitrary point in the future is a different question/issue.
It's even worse. When working with someone else's code e.g. StackOverflow there's a reputation system gating people from the platform and incentivizing them to provide correct answers. You can reasonably expect that someone else's code has at least been thought through to some extent to solve the problem at hand, and very likely tested.
With LLM-generated code, especially ChatGPT-style decoder models, none of that is true. All of the posts and comments I see about it here seem to be anecdotes "it can do all of my job for me" yet asking it to write the simplest code creates several issues on my end.
Personally I think a model geared towards code generation isn't an unsolvable task; the Spider dataset was released some time ago (text to SQL task) and the winning approach there was no fanciness on the model side, but rather to just test all the output queries to ensure it's at least valid SQL. That got a 20%+ boost in accuracy.
Your experience is no less anecdotal than the millions of people who successfully use Copilot and ChatGPT to write code on a daily basis. I am one of those and can't imagine coding without Copilot or an equivalent ever again.
> the millions of people who successfully use Copilot and ChatGPT to write code on a daily basis
Where did you get that number from? Are you saying that roughly one in a thousand person on Earth, alive today, is using Copilot and ChatGPT to write code on a daily basis?
Not the parent but it's not completely impossible. According to [0], there's about 25-30 million software developers in the world. If about 7-8% of them use ChatGPT and Copilot every day, it's already (two) millions.
I guess it's early for this to matter too much for the count, but people who are not "developers" have also used ChatGPT to write code. I've read anecdotes.
Software exists over time. There is no “successful” unless you account for future bugs.
I do believe LLM code generators can be used with good results. I just know that for me that way is slower and more painful, because I need to switch between creative mode (when I make stuff) and debugging mode (when I need to figure out how someone else’s stuff works). I find keyboard typing speed is usually not what slows me down the most…
I'm (genuinely) curious what kind of code you write. I haven't tried Copilot and I haven't used ChatGPT very much, but I feel I would be pretty surprised if either of them made significant improvements to my workflow.
Copilot I could see, since I already use Intellisense, autocomplete, and snippets to great effect. I'd be annoyed if I had to work without them. But in general, knowing what I want the code to do is >90% of the work of writing new code.
I feel there are a few possibilities for why I'm confused:
1. I'm not a very good software engineer, at least in certain respects. Maybe I should have a better understanding of architecture patterns or something I might have learned in a CS degree. Maybe I am hacking everything together and maybe I am already a slow coder.
2. I'm not [being] creative enough as a prompt engineer. I typically can't think of any way that ChatGPT could help me without ingesting my entire repo and figuring out the correct patterns. It could be, however, that there are ways to get the answers I need with better questions.
3. We do completely different kinds of work, and some kinds of coding are better suited for AI assistance than others.
The opposite of 1 is also possible. You're a really good programmer and know the material better, and just don't need to ask the kinds of questions that other people are asking ChatGPT (or stack overflow, or man pages) for/are happy with your current reference materials.
Define successfully. You might verify what the LLM gives you, but lots of people who blindly copy and paste from stack exchange will do the same with chatgpt
Like autopilot in planes that fall back to experienced pilots, we're embarking on the most dangerous "uncanny valley" maneuver where these systems will be adopted by experienced pilots who know the limits but who will inevitably be followed by either no one or students whose conception is entirely synthetic.
At that point the plane AI better be 100% TRUSTWORTHY cause there's no safe fallback.
If you have a choice between descriptions or performance, I humbly suggest detailed descriptions perhaps with links to tutorials and/or further reading. Who cares if the wrong thing is returned quickly if that means it lacks any context.
Also, consider how to express anchoring and/or grouping preferences in the UI or weighting based on highlight positioning. These are oft used features of regex languages.
I’d you don’t understand regexes well enough to write them yourself, you should not get some ai to generate them for you. You won’t be able to verify whether they do what you want and the bugs can be subtle and destructive
I read a few weeks ago here on HN about one large SAAS grinding to a halt because of a greedy selector in one line of regex. Not sure how people find old stories, it's lost to me now. But it was an excellent example of why regex is dangerous and requires a lot of care to write. I wouldn't trust an AI to write my regex unless I saw that people were finding it to be consistently better than they were are writing what they need.
You gave it an example where inferring the semantics you were after was basically a crapshoot. It’s not going to do well under those conditions. Nor will a random human who lacks insight into what specifically you are after. Did you want all the bazzes that are at the end of lines? The bazzes that follow bars? Who knows?
Try giving it examples where the data provides context cues.
Well, I tried extracting fields from some logs I had lying around and I can tell you why I think it's not useful:
1. I select DEBUG and INFO, but it doesn't work out that there are WARNs etc in there and extract those too.
2. Some of the regexes are just.. wrong? I selected individual fields but there's one mangled regex that gives me two fields and the text in between, I didn't ask for that and it's no use.
3. None of the regexes could extract the date I selected (of the form 2023-03-28 05:23:28.844); some of the 'agents' used the literal date, the only one that broken it down into \d's didn't match anything because the DEBUG and INFO were mangled into there.
I'm not really sure how this would be at all useful in its current form?
Well, I know nothing about AI and tried with simple variations of "foo bar baz."
The only solutions that worked were either "\w+ \w+ \w+..." which does not filter anything and may produce errors with other content, or "(first line|second line|third line)" which could be replaced by a bunch of if statements.
The other solutions were plainly wrong but at least they are honest about it and it's shown in the user interface.
How do I tell it to generate a regex for emails?
Try selecting the emails. all four generated Regexes are wrong.
Even if one of them was right? how do I choose between 4 choices if I Don't know the meaning? I have to verify the generated Regexes, Veryfing complicated regexes is much harder than writing them in first place.
How about instead of an AI generating a regex we can't understand, we put energy using actually well developped method for parsing & validating text? Why put code you can't understand in your database?
Reality check : there are people like my colleague who aren't software engineers and still have to occasionally maintain/create a regex in some corporate software config.
That's even worse. They might not have the knowledge to realize the regex an AI gives them is bunk, or to debug it when it fails.
I'd like to see some numbers on a tool like this. If a huge majority of people are seeing genuine improvements in their workflow with it, I won't be a luddite yelling at them. Rare, low-severity failures shouldn't hold us back.
But the potential cost of failure with (any) regex is very high, so I personally wouldn't want to trust any remotely mission-critical to a person who doesn't understand regex well enough to write it themself, and if they can write it on their own that's often faster than debugging AI-generated regex.
If you would like to generate a regular expression by giving an example input text and an example output match, you could use this closed form solution tool:-
https://regex-generator.olafneumann.org/
Usually when you have an AI like this that is supposed to generate verifiable results, you do an adversarial test where you ask it to solve problems that you already know the answer to, to make sure it works.
It looks like no one did that here. Even using the sample data provided, if you highlight a few of the addresses, it can't find the rest of them, mainly because it generates a regex with ST/AVE/LN in it, missing all the ones with RD. And if you add an RD sample, it just adds that to the list.
There's lots of great innovation coming with LLMs, but people are forgetting their "AI basics" when it comes to verifying them.
I just used ChatGPT to create a ton of permutations for product pricing that I'm putting on Stripe as products.
Except... it made ONE ERROR that I just spent two hours tracking down and fixing in my JSON file and now in the Stripe dash. (I coincidentally found the error using ChatGPT lol).
It's probably still faster and less error-prone than I could have done it manually. But it's still error-prone...
The Reflexion paper (https://arxiv.org/abs/2303.11366) that came out recently shows how this kind of mistake might be overcome. Asking the model to think about the answer after it's generated a first draft greatly improves accuracy. Also, prompt engineering such as copying the generated code, pasting it in a new chat and saying "There's a bug in this code, please find it" can go a long way. There is so much low hanging fruit in harnessing the power of these models that is just being ignored because some even lower hanging fruit (RLHF, system messages, context window size, plugins, etc) is being released seemingly every few days.
If you ask the model to "think" about something, and then it simulates that action and outputs what the result of that might be, does it matter if it's really thinking or not? Especially if the output is what we wanted originally?
I would suggest that a person saying "ask the model to think about" in this context in no way implies that that person is confused about the nature of the model, it is simply a convenient piece of language that helps us to achieve the desired result.
He did not say "make the model think about..." or implied the model is thinking. He simply and _correctly_ pointed that if you _ask_ the model to think it improves the answer.
It looks you just pattern-matched on the word _think_ and replied with a pre-made opinion about how AIs can't think. Ironic...
I was curious if this would be smart enough to generate a regex for any four letter word so I copied the tagline of the site and highlighted all four letter words in it. (I have deleted the previous highlights of course.) It generated three regexes that just had a union of those words and one which started off good-ish by looking for any word of length of three or four, but then tacked on some random suffix and in the end this most promising regex turned out to not even match anything in the source text. As a suggestion to the authors of this tool I'd propose to add a step where any generated regexes that don't match anything in the input text are removed from the results.
The results I got from this were unfortunately not useful. For example in trying to extract the property names from a connection string, I highlighted all of the property names along with the equals sign. So for
> Write a regular expression to extract the property names from this PostgreSQL connection string: "PostgresSql": "Host=localhost;User ID=postgres;Password=xxxx;Database=test;Application Name=Test1234,Port=35432;Pooling=false;"
Yields the response with an explanation:
(?<=[^\\w])([A-Za-z ]+)(?==)
"This regular expression will match any sequence of alphabetic characters (upper or lower case) that are followed by an equal sign (=). The negative lookbehind (?<=[^\\w]) ensures that the property name is not preceded by another word character."
A quick test on regex101.com shows this works perfectly.
Sorry, don't like to be overly critical. Someone has attempted to solve a common problem for developers, but LLMs are going to blow applications like this away. And I think that Chat GPT at version 4 has become a truly useful tool.
Cool hack! I'm having some trouble thinking of a case where I wouldn't just explain to Copilot / ChatGPT what I need. Maybe specifically in cases where I had the raw data but not the column titles?
Going a little more "end user needs vs new tech offers", the intent is on the right place, but the output isn't helping as much as normal programmatic tools
At least for me, what would make this a killer app would be the ability of reading a document or pdf or big text dump and
1: identify "possible fields" (first name, date of birth), "probable fields" (middle name or other fields that are part of data set but doesn't appear in every line) and "probable junk data" (page numbers, page headers, useless pdf padding"
2: allow selection or tuning of these fields to generate regex to catch or remove only the data related to the parsed fields.
I THINK there's something done with pandas (pandoc?)that can help tearing a document apart and getting fields or basic doc structure, but AI would need to take it from here and present it in a clear, concise and optionally explained way so a busy office worker could just copy the regex filter in a spreadsheet formula or program function
This doesn't seem to generate great regex, but it does seem to generally work(ish?) so I guess nobody would care. That said, how's this work? Are you just sending this off to one of the AI api's - what's going on with the data pasted in the box after we hit run?
Struck me as funny when we have another thread going about people pasting company data into ChatGPT and here we have a regex AI with an example that looks like it's encouraging you to trust it with helping you regex through your PII, just paste it in the box and highlight what you need lol (not saying that's the intent, just that's what less savvy users may do)
Light on details, heavy on philosophers, trend setters, idea banks, and radicals that make me worried I'm dealing with opportunists taking swings at monetizing a bunch of .ai domains. Especially the weird cinematic banner.
It's nice and there are use cases for it, but if I ever need something like it, I'll prolly just explain what I want to ChatGPT and tell it what regexp engine I'm using, and it'll give me results I'll paste to regexr.com for tests. The only added value here is that I wouldn't need to think of a prompt, but I've become good at finding nice prompts for programming problems, so directly querying ChatGPT is what I'd go for, personally.
Also, I'm not sure what underlying tech is used, and the only explanations on the tool seems to be a Youtube video, so I didn't look further. I'd like to know more about how it's made, if that's possible and something the author would be ok to share.
This is a really nice implementation, so full credit to the creator. Regex is always confusing to compose, but it's also one of those situations where I can't help but wonder if the solution is just to improve upon / provide a nice abstraction for regex rather than handing over full control to a non-deterministic AI.
I've seen at least 2 projects in the last 6 months using LLMs to generate bash code which seems like a similar solve. LLMs are super cool, but there's a massive advantage to actually understanding what you're code does, and LLM generated regex, bash, assembly etc loses that.
Dang, there goes my investment in Jeffrey Friedl's "Mastering Regular Expressions" which launched my programming career after I discovered a reference to it in "Dreamweaver Bible" back in 2000.
To be honest I find ChatGPT sufficient for regex. I usually ask it for test cases that I can then validate in a regex playground to make the regex is working as expected.
even with the examples on the landing page, the regexes generated for the emails are not really usable. it needs way more examples to produce the right thing.
even though I doubt most production code uses the actual, correct, rfc-compliant regex to match emails (it's a monster), this does nothing to improve the situation...
It really need some zero-shot or few-shot magic from LLMs, or even heuristics to detect common patterns like emails, and just generate a sane regex, rather than stuff like [A-Za-z]{2,}@libertylabs\.ai which will obviously fail with a few more examples.
It doesn’t make any sense to use a regex to check emails beside some very basic typos. Why do you need an email? To send messages to the user. Then do that: send a validation message and see if someone gets it.
The input isn’t rich enough to generate generalizable regexes (regecies?). Would probably be better off with a text input explaining what the user wants to match. Text-to-regex includes intent.
That would make it look broken when it's part of what it is. Why should it pretend to be something else? It's not an algorithm that produces a specific result.
Edit: I got a message saying there were too many requests. So much for not appearing broken. And I'm not using a VPN or anything so I'd appear as ordinary traffic.
Honestly, writing a regex is way easier than reading a regex, no? So it feels like now I have the harder task of proving that the generated regex is correct.
Or you can use "verbose regex" which some languages implement like in Python (https://docs.python.org/3/library/re.html#re.X). The spaces are ignored and you can add comments on each line. I used this in the past and my coworkers were happy about it because they could understand the regex and even modify it.
I consider myself a product designer so this is absolutely not true for me. Every time I try to write a Regex I have no idea how to even start. Copilot has been really good at starting me off and then I’ll take it to a regex site and understand it
That's why LLMs aren't much help to me -- they just increase my workload by giving me more code to read and review. If I write it myself, I already know what it means, so that saves time and effort.
I find this mostly pays off in debugging: Having written code usually means I know it better than code I've reviewed, which I know better than code I've never seen. Finding a weird bug in code I know well is a _lot_ easier.
For me writing a regex is easy only if I remember the syntax, which I never do because they differ between languages and I only need them once a month or so.
For me the fastest way is to ask generator to create a valid and not necessarily correct regex, so that I can tweak it. I successfully used gpt for just that recently. It even got the capture groups right.
I agree. I similar arguments ("just write examples") a lot, and I really don't get that. Finding a comprehensive set of examples for code, regexps, shell, whatever is very, very hard.
This feels like a really cool idea for a tool - I would 100% use something that generates matching strings to a regex expression for checking my own or understanding other people's regex.
Or, just write regular expressions?
> ... Regex.ai's intuitive interface makes it easy to input sample text and generate complex regular expressions quickly and efficiently.
See: https://www.ibm.com/topics/overfitting
Inputting the sample text:
And highlighting the first "baz" produced patterns which all had "[A-Z][a-z]*@libertylabs\\.ai" included, assumedly due to the default inclusions.Removing those and highlighting the second "baz" resulted "<Agent B>" as the results in one case.
There is no explanation of any patterns generated. If a person is to use one of the generated patterns and Regex.ai is supposed to "save you time and streamline your workflow", no matter "[w]hether you're a novice or an expert", then some form of verification and/or explanation must exist.
Otherwise, a person must know how to formulate regular expressions in order to determine which, if any, of the presented options are applicable. And if a person knows how to formulate regular expressions, then why would they use Regex.ai?