It’s a pretty common need for web developers to get the plain text from HTML text. Also, there are cases where users paste text from MS Word.
In this article I will post a few simple functions that help me manipulate text that contains HTML and MS Word tags in a various ways.
All function are really small and use regular expressions.
Clear HTML Tags
This function replaces all tags with the ReplaceChar string. By default, the space character is used, because extra spaces don’t create any problem when showing the cleaned text to a web page.
Function StripHTMLTags(ByVal HTMLToStrip As String, Optional ByVal ReplaceChar As String = " ") As String If (HTMLToStrip <> String.Empty) Then Return Regex.Replace(HTMLToStrip, "<.*?>", ReplaceChar) End If Return String.Empty End Function
Clear MS Word Tags
The above function cannot clear all tags that MS Word contains. The function below does the trick. Combining both functions will give you a plain text.
Function StripWordTags(ByVal HTMLToStrip As String, Optional ByVal ReplaceChar As String = " ") As String If (HTMLToStrip <> String.Empty) Then StripWordTags = Regex.Replace(HTMLToStrip, "\n", ReplaceChar) StripWordTags = Regex.Replace(StripWordTags, "\r", ReplaceChar) StripWordTags = Regex.Replace(StripWordTags, "<!--.+?-->", ReplaceChar) StripWordTags = Regex.Replace(StripWordTags, "<!--.+?-->", ReplaceChar) StripWordTags = Regex.Replace(StripWordTags, "<!--.+?-->", ReplaceChar) StripWordTags = Regex.Replace(StripWordTags, "]*>", ReplaceChar) End If End Function Function StripAllTags(ByVal HTMLToStrip As String, Optional ByVal ReplaceChar As String = " ") As String StripAllTags = StripWordTags(HTMLToStrip) StripAllTags = StripHTMLTags(StripAllTags) End Function
String To URL
This function checks if a string is a URL and if not, it converts it to one. This comes handy to fix user input and a value is a website or domain.
Function Str2Url(ByVal Str As String) As String Str2Url = Str If (Str2Url <> "") Then If (Not Str2Url.StartsWith("http://") And Not Str2Url.StartsWith("https://")) Then Str2Url = "http://" & Str2Url End If If (Not Str2Url.EndsWith("/")) Then Str2Url = Str2Url & "/" End If End If End Function
Websafe function
This function replaces from a string all invalid characters that could cause a problem in a URL, e.g. when uploading a file and its original file name contains characters that will break a URL.
Also some web servers cannot handle correctly file names with non latin characters. I have a function that converts all Greek file names to an equivalent name with Latin characters (and vice versa).
Function WebSafe(ByVal Key As String) As String WebSafe = Greek2GrEnglish(Key.Replace("^", "").Replace(" ", "").Replace("|", "")).ToLower() WebSafe = Regex.Replace(WebSafe, "[^0-9|^a-z|^A-Z|^.|^_]", "") WebSafe = Regex.Replace(WebSafe, "\.+", ".") End Function Function Greek2GrEnglish(ByVal GreekStr As String) As String Dim AllMatches() As String = "Α,A:Β,B:Γ,G:Δ,D:Ε,E:Ζ,Z:Η,I:Θ,TH:Ι,I:Κ,K:Λ,L:Μ,M:Ν,N:Ξ,KS:Ο,O:Π,P:Ρ,R:Σ,S:Τ,T:Υ,Y:Φ,F:Χ,X:Ψ,PS:Ω,W:Ά,A:Έ,E:Ή,I:Ί,I:Ό,O:Ύ,Y:Ώ,W:Ϊ,I:Ϋ,Y:ά,a:έ,e:ή,i:ί,i:ΰ,y:α,a:β,b:γ,g:δ,d:ε,e:ζ,z:η,i:θ,th:ι,i:κ,k:λ,l:μ,m:ν,n:ξ,ks:ο,o:π,p:ρ,r:ς,s:σ,s:τ,t:υ,y:φ,f:χ,x:ψ,ps:ω,w:ϊ,i:ϋ,y:ό,o:ύ,y:ώ,o:ΐ,i".Split(":") Dim GreekChar, EnglishChar As String Greek2GrEnglish = GreekStr For Each _Match As String In AllMatches GreekChar = _Match.Split(",")(0) EnglishChar = _Match.Split(",")(1) Greek2GrEnglish = Greek2GrEnglish.Replace(GreekChar, EnglishChar) Next End Function Function GrEnglish2Greek(ByVal GrEnglishStr As String) As String Dim AllMatches() As String = "Α,A:Β,B:Γ,G:Δ,D:Ε,E:Ζ,Z:Η,I:Θ,TH:Ι,I:Κ,K:Λ,L:Μ,M:Ν,N:Ξ,KS:Ο,O:Π,P:Ρ,R:Σ,S:Τ,T:Υ,Y:Φ,F:Χ,X:Ψ,PS:Ω,W:Ά,A:Έ,E:Ή,I:Ί,I:Ό,O:Ύ,Y:Ώ,W:Ϊ,I:Ϋ,Y:ά,a:έ,e:ή,i:ί,i:ΰ,y:α,a:β,b:γ,g:δ,d:ε,e:ζ,z:η,i:θ,th:ι,i:κ,k:λ,l:μ,m:ν,n:ξ,ks:ο,o:π,p:ρ,r:ς,s:σ,s:τ,t:υ,y:φ,f:χ,x:ψ,ps:ω,w:ϊ,i:ϋ,y:ό,o:ύ,y:ώ,o:ΐ,i".Split(":") Dim GreekChar, EnglishChar As String GrEnglish2Greek = GrEnglishStr For Each _Match As String In AllMatches GreekChar = _Match.Split(",")(0) EnglishChar = _Match.Split(",")(1) GrEnglish2Greek = GrEnglish2Greek.Replace(EnglishChar, GreekChar) Next End Function
URLSafe function
This function is a stricter version of the WebSafe function. I am using this function when I need to upload files the RackspaceCloud’s CDN platform, Cloud Files. Also, this function uses the “-” character as the default replace character. This is also useful to create URL titles for new posts in a CMS.
Function UrlSafe(ByVal Key As String) As String UrlSafe = Regex.Replace(Greek2GrEnglish(Key.ToLower()), "([^_0-9a-zA-Z])", "-") While (UrlSafe.Contains("--")) UrlSafe = UrlSafe.Replace("--", "-") End While UrlSafe = UrlSafe.Trim("-") End Function
CutText function
This function cuts a long text to a limit of characters, but also makes sure that the last word of the text will not be cut in half.
Function CutText(ByVal SomeText As String, Optional ByVal Length As Integer = 200) As String SomeText = Functions.StripAllTags(SomeText) If (SomeText.Length < Length + 1 Or Length = 0) Then Return SomeText Else Return (SomeText.Substring(0, Length + SomeText.Substring(Length).IndexOf(" "))) End If End Function
GetPathFromImageTag function
This function finds an <img> HTML tag and returns the path from the src attribute
Function GetPathFromImageTag(ByVal ImageTag As String) As String If (Not ImageTag.Contains("src=""")) Then Return String.Empty GetPathFromImageTag = ImageTag.Substring(ImageTag.IndexOf("src=""") + 5) GetPathFromImageTag = GetPathFromImageTag.Substring(0, GetPathFromImageTag.IndexOf("""")) End Function
All functions are simple and provide basic functionality. I hope that you will find them useful!
Leave a Reply