• Skip to main content
  • Skip to primary sidebar
  • Skip to secondary sidebar
  • Skip to footer

Coding Still

  • Home
  • About

Clearing HTML tags and MS Word tags from text

August 28, 2012 By _tasos Leave a Comment

It’s a pretty common need for web developers to get the plain text from HTML text. Also, there are cases where users paste text from MS Word.

In this article I will post a few simple functions that help me manipulate text that contains HTML and MS Word tags in a various ways.

All function are really small and use regular expressions.

Clear HTML Tags

This function replaces all tags with the ReplaceChar string. By default, the space character is used, because extra spaces don’t create any problem when showing the cleaned text to a web page.

Function StripHTMLTags(ByVal HTMLToStrip As String, Optional ByVal ReplaceChar As String = " ") As String
    If (HTMLToStrip <> String.Empty) Then
        Return Regex.Replace(HTMLToStrip, "<.*?>", ReplaceChar)
    End If
    Return String.Empty
End Function

Clear MS Word Tags

The above function cannot clear all tags that MS Word contains. The function below does the trick. Combining both functions will give you a plain text.

Function StripWordTags(ByVal HTMLToStrip As String, Optional ByVal ReplaceChar As String = " ") As String
    If (HTMLToStrip <> String.Empty) Then
        StripWordTags = Regex.Replace(HTMLToStrip, "\n", ReplaceChar)
        StripWordTags = Regex.Replace(StripWordTags, "\r", ReplaceChar)
        StripWordTags = Regex.Replace(StripWordTags, "<!--.+?-->", ReplaceChar)
        StripWordTags = Regex.Replace(StripWordTags, "&lt;!--.+?--&gt;", ReplaceChar)
        StripWordTags = Regex.Replace(StripWordTags, "<!--.+?-->", ReplaceChar)
        StripWordTags = Regex.Replace(StripWordTags, "]*>", ReplaceChar)
    End If
End Function
 
Function StripAllTags(ByVal HTMLToStrip As String, Optional ByVal ReplaceChar As String = " ") As String
    StripAllTags = StripWordTags(HTMLToStrip)
    StripAllTags = StripHTMLTags(StripAllTags)
End Function

String To URL

This function checks if a string is a URL and if not, it converts it to one. This comes handy to fix user input and a value is a website or domain.

Function Str2Url(ByVal Str As String) As String
    Str2Url = Str
    If (Str2Url <> "") Then
        If (Not Str2Url.StartsWith("http://") And Not Str2Url.StartsWith("https://")) Then
            Str2Url = "http://" & Str2Url
        End If
        If (Not Str2Url.EndsWith("/")) Then
            Str2Url = Str2Url & "/"
        End If
    End If
End Function

Websafe function

This function replaces from a string all invalid characters that could cause a problem in a URL, e.g. when uploading a file and its original file name contains characters that will break a URL.
Also some web servers cannot handle correctly file names with non latin characters. I have a function that converts all Greek file names to an equivalent name with Latin characters (and vice versa).

Function WebSafe(ByVal Key As String) As String
    WebSafe = Greek2GrEnglish(Key.Replace("^", "").Replace(" ", "").Replace("|", "")).ToLower()
    WebSafe = Regex.Replace(WebSafe, "[^0-9|^a-z|^A-Z|^.|^_]", "")
    WebSafe = Regex.Replace(WebSafe, "\.+", ".")
End Function
 
Function Greek2GrEnglish(ByVal GreekStr As String) As String
    Dim AllMatches() As String = "Α,A:Β,B:Γ,G:Δ,D:Ε,E:Ζ,Z:Η,I:Θ,TH:Ι,I:Κ,K:Λ,L:Μ,M:Ν,N:Ξ,KS:Ο,O:Π,P:Ρ,R:Σ,S:Τ,T:Υ,Y:Φ,F:Χ,X:Ψ,PS:Ω,W:Ά,A:Έ,E:Ή,I:Ί,I:Ό,O:Ύ,Y:Ώ,W:Ϊ,I:Ϋ,Y:ά,a:έ,e:ή,i:ί,i:ΰ,y:α,a:β,b:γ,g:δ,d:ε,e:ζ,z:η,i:θ,th:ι,i:κ,k:λ,l:μ,m:ν,n:ξ,ks:ο,o:π,p:ρ,r:ς,s:σ,s:τ,t:υ,y:φ,f:χ,x:ψ,ps:ω,w:ϊ,i:ϋ,y:ό,o:ύ,y:ώ,o:ΐ,i".Split(":")
    Dim GreekChar, EnglishChar As String
    Greek2GrEnglish = GreekStr
 
    For Each _Match As String In AllMatches
        GreekChar = _Match.Split(",")(0)
        EnglishChar = _Match.Split(",")(1)
        Greek2GrEnglish = Greek2GrEnglish.Replace(GreekChar, EnglishChar)
    Next
End Function
 
Function GrEnglish2Greek(ByVal GrEnglishStr As String) As String
    Dim AllMatches() As String = "Α,A:Β,B:Γ,G:Δ,D:Ε,E:Ζ,Z:Η,I:Θ,TH:Ι,I:Κ,K:Λ,L:Μ,M:Ν,N:Ξ,KS:Ο,O:Π,P:Ρ,R:Σ,S:Τ,T:Υ,Y:Φ,F:Χ,X:Ψ,PS:Ω,W:Ά,A:Έ,E:Ή,I:Ί,I:Ό,O:Ύ,Y:Ώ,W:Ϊ,I:Ϋ,Y:ά,a:έ,e:ή,i:ί,i:ΰ,y:α,a:β,b:γ,g:δ,d:ε,e:ζ,z:η,i:θ,th:ι,i:κ,k:λ,l:μ,m:ν,n:ξ,ks:ο,o:π,p:ρ,r:ς,s:σ,s:τ,t:υ,y:φ,f:χ,x:ψ,ps:ω,w:ϊ,i:ϋ,y:ό,o:ύ,y:ώ,o:ΐ,i".Split(":")
    Dim GreekChar, EnglishChar As String
    GrEnglish2Greek = GrEnglishStr
 
    For Each _Match As String In AllMatches
        GreekChar = _Match.Split(",")(0)
        EnglishChar = _Match.Split(",")(1)
        GrEnglish2Greek = GrEnglish2Greek.Replace(EnglishChar, GreekChar)
    Next
End Function

URLSafe function

This function is a stricter version of the WebSafe function. I am using this function when I need to upload files the RackspaceCloud’s CDN platform, Cloud Files. Also, this function uses the “-” character as the default replace character. This is also useful to create URL titles for new posts in a CMS.

Function UrlSafe(ByVal Key As String) As String
    UrlSafe = Regex.Replace(Greek2GrEnglish(Key.ToLower()), "([^_0-9a-zA-Z])", "-")
    While (UrlSafe.Contains("--"))
        UrlSafe = UrlSafe.Replace("--", "-")
    End While
    UrlSafe = UrlSafe.Trim("-")
End Function

CutText function

This function cuts a long text to a limit of characters, but also makes sure that the last word of the text will not be cut in half.

Function CutText(ByVal SomeText As String, Optional ByVal Length As Integer = 200) As String
    SomeText = Functions.StripAllTags(SomeText)
 
    If (SomeText.Length < Length + 1 Or Length = 0) Then
        Return SomeText
    Else
        Return (SomeText.Substring(0, Length + SomeText.Substring(Length).IndexOf(" ")))
    End If
End Function

GetPathFromImageTag function

This function finds an <img> HTML tag and returns the path from the src attribute

Function GetPathFromImageTag(ByVal ImageTag As String) As String
    If (Not ImageTag.Contains("src=""")) Then Return String.Empty
 
    GetPathFromImageTag = ImageTag.Substring(ImageTag.IndexOf("src=""") + 5)
    GetPathFromImageTag = GetPathFromImageTag.Substring(0, GetPathFromImageTag.IndexOf(""""))
End Function

All functions are simple and provide basic functionality. I hope that you will find them useful!

Filed Under: .NET Development, ASP.NET Tagged With: Regular expressions

Reader Interactions

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Primary Sidebar

Categories

  • .NET Development
  • ASP.NET
  • Databases
  • Fun
  • IIS
  • JavaScript
  • Web Development

Tags

.NET Core Android ANTLR ASP.NET Ajax ASP.NET Core ASP.NET MVC ASP.NET Web Forms AWS Bouncy Castle Chartjs cli Client info detection Comic Continuous integration CSS Data backup Date handling Firebase Firefox addons Github HigLabo HTML5 Image manipulation jQuery JWT MySQL Nodejs Nuget OAuth Objectionjs OOP openssl Oracle ORM PHP Regular expressions SEO Social media SQL SQL Server UI/UX Url rewriting Videos Visual Studio Web design

Meta

  • Log in
  • Entries feed
  • Comments feed
  • WordPress.org

Secondary Sidebar

Archives

  • July 2020
  • March 2020
  • August 2019
  • December 2018
  • November 2018
  • February 2018
  • August 2016
  • June 2016
  • May 2016
  • February 2016
  • January 2016
  • August 2015
  • July 2015
  • October 2014
  • July 2014
  • November 2013
  • April 2013
  • February 2013
  • January 2013
  • December 2012
  • November 2012
  • August 2012
  • May 2012
  • February 2012
  • December 2011
  • October 2011
  • September 2011
  • August 2011
  • July 2011
  • May 2011
  • April 2011
  • March 2011
  • February 2011
  • January 2011
  • December 2010
  • November 2010
  • October 2010
  • September 2010
  • August 2010
  • July 2010

Footer

Recent Posts

  • Anatomy of an Objection.js model
  • Check your RSA private and public keys
  • Round functions on the Nth digit
  • Send FCM Notifications in C#
  • Jwt Manager
  • Things around the web #5
  • Query JSON data as relational in MySQL
  • Create and sign JWT token with RS256 using the private key
  • Drop all database objects in Oracle
  • Create and deploy a Nuget package

Latest tweets

  • Geekiness Intensifies.. NASA used Three.js to render a real-time simulation of this week's NASA rover landing on M… https://t.co/orgkXnYj9O February 19, 2021 18:12
  • Things I Wished More Developers Knew About Databases https://t.co/h4gfq6NJgo #softwaredevelopment #databases May 3, 2020 12:52
  • How a Few Lines of Code Broke Lots of Packages https://t.co/p7ZSiLY5ca #javascript May 3, 2020 12:48
  • Can someone steal my IP address and use it as their own? https://t.co/HoQ7Z3BG69 January 24, 2020 13:27
  • Organizational complexity is the best predictor of bugs in a software module https://t.co/aUYn9hD4oa #softwaredevelopment January 13, 2020 08:24
  • http://twitter.com/codingstill

Misc Links

  • George Liatsos Blog
  • Plethora Themes
  • C# / VB Converter
  • Higlabo: .NET library for mail, DropBox, Twitter & more

Connect with me

  • GitHub
  • LinkedIn
  • RSS
  • Twitter
  • Stack Overflow

Copyright © 2021 · eleven40 Pro on Genesis Framework · WordPress · Log in